In this workshop we continue learning about hypothesis testing and we start learning about measures of linear relationships.
1 Workshop Directions
You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a google colab document for each workshop.
You must share each Colab document (workshop) with the following accounts:
cdorante.tec@gmail.com
cdorante@tec.mx
You must give Edit privileges to these accounts.
You have to follow this workshop in class to learn about topics. You have to do your own Markdown note/document for every workshop we cover.
Rename your Notebook as “W1-Statistics-AI-YourFirstName-YourLastname”.
You must submit your workshop before we start with the next workshop. What you have to write in your workshop? You have to:
You have to REPLICATE and RUN all the Python code, and
DO ALL CHALLENGES stated in sections. These challenges can be Python code or just responding QUESTIONS with your own words and in CAPITAL LETTERS. You have to WRITE CLEARLY so that I can see your LINE OF THINKING!
The submissions of your workshops is a REQUISITE for grading your final deliverable document of the Statistics Module.
I strongly recommended you to write your OWN NOTES about the topics as if it were your study NOTEBOOK.
2 Hypothesis testing - comparing the mean of 2 groups
Last workshop we learn about hypothesis testing for the mean of one group. This test is usually named one-sample t-test.
Now we will do a hypothesis testing to compare the means of two groups. This test is usually named two-sample t-test.
In the case of the two-sample t-test we try to check whether the mean of a group is greater than the mean of another group.
Imagine we have two random variables X and Y and we take a random sample of each variable to check whether the mean of X is greater than the mean of Y.
We start writing the null and alternative hypothesis as follows:
H0:\mu_{x}=\mu_{y}
Ha:\mu_{x}\neq\mu_{y}
We do simple algebra to leave a number in the right-hand side of the equality, and a random variable in the left-hand side of the equation. Then, we re-write these hypotheses as:
H0:(\mu_{x}-\mu_{y})=0
Ha:(\mu_{x}-\mu_{y})\neq0
The Greek letter \mu is used to represent the population mean of a variable.
To test this hypothesis we take a random sample of X and Y and calculate their means.
Then, in this case, the variable of study is the difference of 2 means! Then, we can name the variable of study as diff:
diff = (\mu_{x}-\mu_{y})
Since we use sample means instead of population means, we can re-define this difference as:
diff = (\bar{X}-\bar{Y})
The steps for all hypothesis tests are basically the same. What changes is the calculation of the standard deviation of the variable of study, which is usually names standard error.
For the case of one-sample t-test, the standard error was calculated as \frac{SD}{\sqrt{N}}, where SD is the individual sample standard deviation of the variable, and N is the sample size.
In the case of two-sample t-test, the standard error SE can be calculated using different formulas depending on the assumptions of the test. In this workshop, we will assume that the population variances of both groups are NOT EQUAL, and the sample size of both groups is the same (N). For these assumptions, the formula is the following:
SD(diff)=SE=\sqrt{\frac{Var(X)+Var(Y)}{N}}
But, where does this formula come from?
We can easily derive this formula by applying basic probability rules to the variance of a difference of 2 means. Let’s do so.
The variances of each group of X and Y might be different, so we can estimate the variance of the DIFFERENCE as:
Var(\bar{X}-\bar{Y})=Var(\bar{X})+Var(\bar{Y})
This is true if only if \bar{X} and \bar{Y} are independent. We will assume that both random variables are not dependent of each other. This might not apply for certain real-world problems, but we will assume that for simplicity. If there is dependence I need to add another term that is equal to 2 times the covariance between both variables.
Why the variance of a difference of 2 random variables is the SUM of their variance? This sounds counter-intuitive, but it is correct. The intuition behind this is that when we make the difference we do not know which random variable will be negative or positive. If a value of \bar{Y}_i<0 then we will end up with a SUM instead of a difference!
As we learned in the CLT, the variance of the mean of a random variable is reduced according to its sample size: Var(\bar{X)}=\frac{Var(X)}{N}. Then:
Then, the method for hypothesis testing is the same we did in the case of one-sample t-test. We just need to use this formula as the denominator of the t-statistic.
The, the t-statistic for the two-sample t-test is calculated as:
Remember that the value of t is the # of standard deviations of the variable of study (in this case, the difference of the 2 means) that the empirical difference we got from the data is away from the hypothetical value, zero.
The rule of thumb we have used is that if |t|>2 we have have statistical evidence at least at the 95% confidence level to reject the null hypothesis (or to support our alternative hypothesis).
3 Confidence level, Type I Error and pvalue
The confidence level of a test is related to the error level of the test. For a confidence level of 95% there is a probability that we make a mistaken conclusion of rejecting the null hypothesis. Then, for a 95% confidence level, we can end up in a mistaken conclusion 5% of the time. This error is also called the Type I Error.
The pvalue of the test is actually the exact probability of making a Type I Error after we calculate the exact t-statistic. In other words, the pvalue is the probability that we will be wrong if we reject the null hypothesis (and support our hypothesis).
For each value of a t-statistic, there is a corresponding pvalue. We can relate both values in the following figure of the t-Student PDF:
Illustrating t-Statistics vs pvalue
For a 95% confidence level and 2-tailed pvalue, the critical t value is close to 2 (is not exactly 2); it can change according to N, the # of observation of the sample.
When the sample size N>30, the t-Student distribution approximates the Z normal distribution. In the above figure we can see that when N>30 and t=2, the approximates pvalues are: 1-tailed pvalue = 2.5%, and 2-tailed pvalues=5%.
Then, what is 1-tailed and 2-tailed pvalues? The 2-tailed pvalue will always be twice the value of the 1-tailed pvalue since the t-Student distribution is symetric.
We always want to have a very small pvalue in order to reject H0. Then, the 1-tailed pvalue seems to be the one to use. However, the 2-tailed pvalue is a more conservative value (the diablito will feel ok with this value). Most of the statistical software and computer languages report 2-tailed pvalue.
Then, which pvalue is the right to use? It depends on the context. When there is a theory that supports the alternative hypothesis, we can use the 1-tailed pvalue. For now, we can be conservative and use the 2-tailed pvalue for our t-tests.
Then, we can define the p-value of a t-test (in terms of the confidence level of the test) as:
pvalue=(1-ConfidenceLevel)
In the case of 1-tailed pvalue and a 95% confidence level, the critical t-value is less than 2; it is approximately 1.65:
Illustrating 1-tailed t and pvalues
The pvalue cannot be calculated with an analytic formula since the integral of the Z normal or t-Student PDF has no close analytic solution. We need to use tables. Fortunately, all statistic software and most computer languages can easily calculate pvalues for any hypothesis test.
3.1 CHALLENGE - IS AMD MEAN RETURN HIGHER THAN INTEL MEAN RETURN?
Do a t-test to check whether the mean monthly cc return of AMD (AMD) is greater than the mean monthly return of Intel. Use data from Jan 2017 to date.
Code
import pandas as pdimport numpy as npimport yfinance as yf# Getting price data and selecting adjusted price columns:sprices=yf.download(tickers="AMD INTC", start="2017-01-01",interval="1mo")sprices=sprices['Adj Close']
[ 0%% ][*********************100%%**********************] 2 of 2 completed
Code
# Calculating returns:sr = np.log(sprices) - np.log(sprices.shift(1))# Deleting the first month with NAs:sr=sr.dropna()
Code
# Stating the hypotheses: # H0: (mean(rAMD) - mean(rINTEL)) = 0# Ha: (mean(rAMD) - mean(rINTEL)) <> 0# Calculating the standard error of the difference of the means:N = sr['AMD'].count()amdvar = sr['AMD'].var()intelvar = sr['INTC'].var()sediff = np.sqrt((1/N) * (amdvar + intelvar ) )# Calculating the t-Statistic:t = (sr['AMD'].mean() - sr['INTC'].mean()) / sedifft
1.666429623285142
Code
# Calculating the pvalue from the t-Statistic:from scipy import stats as st# The st.t.sf function calculates the 1-tailed pvalue, so we multiply it by 2 to get the 2-tailed pvalue# the degrees of freedom for 2-independent-means t-test is calculated with the following formula:df = ( ((N-1) / N**2) * (amdvar + intelvar)**2/ ( (amdvar/N)**2+ (intelvar/N)**2 ) )# Now we calculate the pvalue with the t and df:pvalue =2* st.t.sf(np.abs(t), df)pvalue
0.09764755281471679
Code
# Using the ttest_ind function from stats:st.ttest_ind(sr['AMD'],sr['INTC'],equal_var=False)# We got the same result as above!# With this function we avoid calculating all steps of the hypothesis test!
import researchpy as rp# Using the ttest function from researchpy:rp.ttest(sr['AMD'],sr['INTC'],equal_variances=False)# We got the same result as above!# With this function we avoid calculating all steps of the hypothesis test!
C:\Users\Alberto\AppData\Roaming\Python\Python310\site-packages\researchpy\ttest.py:38: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
groups = group1.append(group2, ignore_index= True)
( Variable N Mean SD SE 95% Conf. Interval
0 AMD 91.0 0.028587 0.158339 0.016598 -0.004388 0.061563
1 INTC 91.0 -0.004453 0.103446 0.010844 -0.025996 0.017091
2 combined 182.0 0.012067 0.134394 0.009962 -0.007589 0.031724,
Satterthwaite t-test results
0 Difference (AMD - INTC) = 0.0330
1 Degrees of freedom = 154.9895
2 t = 1.6664
3 Two side test p value = 0.0976
4 Difference < 0 p value = 0.9512
5 Difference > 0 p value = 0.0488
6 Cohen's d = 0.2470
7 Hedge's g = 0.2460
8 Glass's delta1 = 0.2087
9 Point-Biserial r = 0.1327)
4 Measures of linear relationship
We might be interested in learning whether there is a pattern of movement of a random variable when another random variable moves up or down. An important pattern we can measure is the linear relationship. The main two measures of linear relationship between 2 random variables are:
Covariance and
Correlation
Let’s start with an example. Imagine we want to see whether there is a relationship between the S&P500 and Microsoft stock.
The S&P500 is an index that represents the 500 biggest US companies, which is a good representation of the US financial market. We will use monthly data for the last 3-4 years.
Let’s download the price data and do the corresponding return calculation. Instead of pandas, we will use yfinance to download online data from Yahoo Finance from 2019 to date:
import numpy as npimport pandas as pdimport yfinance as yfimport matplotlibimport matplotlib.pyplot as plt# We download price data for Microsoft and the S&P500 index:prices=yf.download(tickers="MSFT ^GSPC", start="2019-01-01", end="2024-07-31", interval="1mo")# We select Adjusted closing prices and drop any row with NA values:adjprices = prices['Adj Close'].dropna()
[ 0%% ][*********************100%%**********************] 2 of 2 completed
GSPC stands for Global Standard & Poors Composite, which is the S&P500 index.
Now we will do some informative plots to start learning about the possible relationship between GSPC and MSFT.
Unfortunately, the range of stock prices and market indexes can vary a lot, so this makes difficult to compare price movements in one plot. For example, if we plot the MSFT prices and the S&P500:
It looks like the GSPC has had a better performance, but this is misleading since both investment have different range of prices.
When comparing the performance of 2 or more stock prices and/or indexes, it is a good idea to generate an index for each series, so that we can emulate how much $1.00 invested in each stock/index would have moved over time. We can divide the stock price of any month by the stock price of the first month to get a growth factor:
Now we have a much better picture of which instrument has had better performance over time. The line of each instrument represents how much $1.00 invested the instrument would have been changing over time.
Now we calculate continuously compounded monthly returns. With pandas most of the data management functions works row-wise. In other words, operations are performed to all columns by row:
r = np.log(adjprices) - np.log(adjprices.shift(1))# Dropping rows with NA values (the first month will have NAs)r = r.dropna()# Selecting only 2 columns (out of the 4 columns):r = r[['MSFT','^GSPC']]# Renameing the column names:r.columns = ['MSFT','GSPC']
Now the r dataframe will have 2 columns for both cc historical returns:
r.head()
MSFT
GSPC
Date
2019-02-01
0.070250
0.029296
2019-03-01
0.055671
0.017766
2019-04-01
0.101964
0.038560
2019-05-01
-0.054442
-0.068041
2019-06-01
0.083539
0.066658
To learn about the possible relationship between the GSPC and MSFT we can look at their prices and also we can look at their returns.
We start with a scatter plot to see whether there is a linear relationship between the MSFT prices and the GSPC index:
What do you see? Which plot conveys a stronger linear relationship?
The scatter plot using the prices conveys an apparent stronger linear relationship compared to the scatter plot using returns.
Stock returns are variables that usually does NOT grow over time; they look like a plot of heart bits:
plt.clf()r.plot(y=['MSFT','GSPC'])plt.show()
<Figure size 672x480 with 0 Axes>
Stock returns behave like a stationary variable since they do not have a growing or declining trend over time. A stationary variable is a variable that has a similar average and standard deviation in any time period.
Stock prices (and indexes) are variables that usually grow over time (sooner or later). These variables are called non-stationary variables. A non-stationary variable usually changes its mean depending on the time period.
In statistics, we have to be very careful when looking at linear relationships when using non-stationary variables, like stock prices. It is very likely that we end up with spurious measures of linear relationships when we use non-stationary variables. To learn more about the risk of estimating spurious relationships, we will cover this issue in the topic of time-series regression models (covered in a more advanced module).
Then, in this case it is better to look at linear relationship between stock returns (not prices).
4.1 Covariance
The Covariance between 2 random variables, X and Y, is a measure of linear relationship.
The Covariance is the average of product deviations between X and Y from their corresponding means.
For a sample of N and 2 random variables X and Y, we can calculate the population covariance as:
Why dividing by N-1 instead of N? In Statistics, we assume that we work with samples and never have access to the population, so when calculating a sample measure, we always miss data. The sample formula will calculate a more conservative than the population formula. That is the reason why we use N-1 as degree of freedom instead of N.
Sample covariance will be always a little bit greater than population covariance, but they will be similar. When N is large (N>30), population and sample covariance values will be almost the same. The sample covariance formula is the default formula for all statistical software.
If Cov(X,Y)>0, we can say that, on average, there is a positive linear relationship between X and Y. If Cov(X,Y)<0, we can say that there is a negative relationship between X and Y.
A positive linear relationship between X and Y means that if X increases, it is likely that Y will also increase; and if X decreases, it is likely that Y will also decrease.
A negative linear relationship value between X and Y means that if X increases, it is likely that Y will decrease; and if X decreases, it is likely that Y will increase.
If we can test that Cov(X,Y) is positive and significant, we need to do a hypothesis test. If the pvalue<0.05 and the Cov(X,Y) is positive, then we can say that we have a 95% confidence that there is a linear relationship.
There is no constraint in the possible values of Cov(X,Y) that we can get:
-\infty<Cov(X,Y)<\infty
We can interpret the sign of covariance, but we CANNOT interpret its magnitude. Fortunately, the correlation is a very practical measure of linear relationship since we can interpret its sign and magnitude since the possible values of correlation goes from -1 to 1 and represent percentage of linear relationship.
Actually, the correlation between X and Y is a standardized measure of the covariance.
4.2 Correlation
Correlation is a very practical measure of linear relationship between 2 random variables. It is actually a scaled version of the Covariance:
Corr(X,Y)=\frac{Cov(X,Y)}{SD(X)SD(Y)}
If we divide Cov(X,Y) by the product of the standard deviations of X and Y, we get the correlation, which can have values only between -1 and +1.
-1<=Corr(X,Y)<=1
If Corr(X,Y) = +1, that means that X moves exactly in the same way than Y, so Y is proportional (in the same direction) than X; actually Y should be equal to X multiplied by number.
If Corr(X,Y) = -1 means that Y moves exactly proportional to X, but in the opposite direction.
If Corr(X,Y) = 0 means that the movements of Y are not related to the movements of X. In other words, that X and Y move independent of each other; in this case, there is no clear linear pattern of how Y moves when X moves.
If 0<Corr(X,Y)<1 means that there is a positive linear relationship between X and Y. The strength of this relationship is given by the magnitude of the correlation. For example, if Corr(X,Y) = 0.50, that means that if X increases, there is a probability of 50% that Y will also increase.
If -1<Corr(X,Y)<0 means that there is a negative linear relationship between X and Y. The strength of this relationship is given by the magnitude of the correlation. For example, if Corr(X,Y) = - 0.50, that means that if X increases, there is a probability of 50% that Y will decrease (and vice versa).
If we want to test that Corr(X,Y) is positive and significant, we need to do a hypothesis test. The formula for the standard error (standard deviation of the correlation) is:
SD(corr)=\sqrt{\frac{(1-corr^{2})}{(N-2)}}
Then, the t-Statistic for this hypothesis test will be:
t=\frac{corr}{\sqrt{\frac{(1-corr^{2})}{(N-2)}}}
If Corr(X,Y)>0 and t>2 (its pvalue will be <0.05), then we can say that we have a 95% confidence that there is a positive linear relationship; in other words, that the correlation is positive and statistically significant (significantly greater than zero).
4.3 Calculating covariance and correlation
We can program the covariance of 2 variables according to the formula:
The cov function calculates the covariance matrix using both returns. We can find the covariance in the non-diagonal elements, which will be the same values since the covariance matrix is symetric.
The diagonal values have the variances of each return since the covariance of one variable with itself is actually its variance (Cov(X,X) = Var(X) ) .
Then, to extract the covariance between MSFT and GSPC returns we can extract the element in the row 1 and column 2 of the matrix:
cov = covm[0,1]cov
0.002374822954620712
This value is exactly the same we calculated manually.
We can use the corrcoef function of numpy to calculate the correlation matrix:
corr = np.corrcoef(r['MSFT'],r['GSPC'])corr
array([[1. , 0.74473065],
[0.74473065, 1. ]])
The correlation matrix will have +1 in its diagonal since the correlation of one variable with itself is +1. The non-diagonal value will be the actual correlation between the corresponding 2 variables (the one in the row, and the one in the column).
We could also manually calculate correlation using the previous covariance:
We can use the scipy pearsonr function to calculate correlation and also the 2-tailed pvalue to see whether the correlation is statistically different than zero:
from scipy.stats import pearsonrcorr2 = pearsonr(r['MSFT'],r['GSPC'])corr2
The pvalue is almost zero . MSFT and GSPC returns have a positive and very significant correlation (at the 99.9999…% confidence level).
5 Introduction to Linear Regression
Up to know we have learn about
Descriptive Statistics
The Histogram
The Central Limit Theorem
Hypothesis Testing
Covariance and Correlation
Without the idea of summarizing data with descriptive statistics, we cannot conceive the histogram. Without the idea of the histogram we cannot conceive the CLT, and without the CLT we cannot make inferences for hypothesis testing. We can apply hypothesis testing to test claims about random variables. These random variables can be one mean, difference of 2 means, correlation, and also coefficients of the linear regression model. But what is the linear regression model?
We learned that covariance and correlation are measures of linear relationship between 2 random variables, X and Y. The simple regression model also measures the linear relationship between 2 random variables (X and Y), but the difference is that X is supposed to explain the movements Y, so Y depends on the movement of X, the independent variable. In addition, the regression model estimates a linear equation (regression line) to represent how much Y (on average) moves with movements of X, and what is the expected value of Y when X=0.
The simple linear regression model is used to understand the linear relationship between two variables assuming that one variable, the independent variable (IV), can be used as a predictor of the other variable, the dependent variable (DV).
Besides using linear regression models to better understand how the dependent variable moves or changes according to changes in the independent variable, linear regression models are also used for prediction or forecasting of the dependent variable.
The simple regression model considers only one independent variable, while the multiple regression model can include more than one independent variables. But both models only consider one dependent variable. Then, we can use regression models for:
• Understanding the relationship between a dependent variable and a one or more independent variables - also called explanatory variables
• Predicting or estimating the expected value of the dependent variable according to specific value of the independent variables
In regression models it is assumed that there is a linear relationship between the dependent variable and the independent variables. It might be possible that in reality, the relation is not linear. A linear regression model does not capture non-linear relationships unless we do specific mathematical transformations of the variables.
In this workshop I illustrate the simple regression model with an example from Finance: the Market Regression Model.
The Market Model states that the expected return of a stock is given by its alpha coefficient (α) plus its market beta coefficient (β) times the market return. In mathematical terms:
E[R_i] = α + β(R_M)
We can express the same equation using β_0 as alpha, and β_1 as market beta:
E[R_i] = β_0 + β_1(R_M)
We can estimate the β_0 and β_1 coefficients by running a simple linear regression model specifying that the market return is the independent variable and the stock return is the dependent variable. It is strongly recommended to use continuously compounded returns instead of simple returns to estimate the market regression model.
The market regression model can be expressed as:
r_{(i,t)} = b_0 + b_1*r_{(M,t)} + ε_t
Where:
ε_t is the error at time t. Thanks to the Central Limit Theorem, this error behaves like a Normal distributed random variable ∼ N(0, σ_ε) since the error is actually a linear combination of random variables; the error term ε_t is expected to have mean=0 and a specific standard deviation σ_ε.
r_{(i,t)} is the return of the stock i at time t.
r_{(M,t)} is the market return at time t
b_0 and b_1 are called regression coefficients
6 Interesting facts from history
One of the most common methods to estimate linear regression models is called ordinary least squares, which was first developed by mathematicians to predict planets’ orbits. On January 1st, 1801, the Italian priest and astronomer Giuseppe Piazzi discovered a small planetoid (asteroid) in our solar system, which he named Ceres. Piazzi observed and recorded 22 Ceres positions during 42 days, but suddenly Ceres was lost in glare of the Sun. Then, most Europeans astronomers started to find out a way to predict Cere’s orbit. The great German mathematician Friedrich Carl Gauss successfully predicted Ceres’ orbit using a least squares method he had developed in 1796, when he was 18 years old. Gauss applied his least squares method using the 22 Ceres observations and 6 explanatory variables. Gauss published his least square method until 1809 (Gauss 1809); interestingly, the French mathematician Arien-Marie Legendre first published the least-squared method in 1805 (Legendre 1805).
About 70 years later, the English anthropologist Francis Galton and the English mathematician Karl Pearson - leading founders of the Statistics discipline- used the foundations of the least-square method to first develop the linear regression model. Galton developed the conceptual foundation of regression models when he was studying the inherited characteristics of sweet peas. Pearson further developed Galton ideas following rigurous mathematical development.
Pearson used to work in Galton’s laboratory. When Galton died, Pearson wrote Galton’s biography. In this biography (Pearson 1930), Pearson described how Galton came up with the idea of regression. In 1875 Galton gave sweet peas seeds to seven friends. All sweet peas seeds had uniform weights. His friends harvested the sweet peas and returned the plants to Galton. He did a graph to see the size of each plant compared with their respective parents’ sizes. He found that all of them had parents with higher size. When graphing the offspring’s size as the Y axis, and parents’ size as the X axis, he tried to manually draw a line that could represent this relationship, and he found that the line slope was less than 1.0. He concluded that the size of these plants in their generation was “regressing” to the supposed mean of this specie (considering several generations).
Two research articles by Galton (Galton 1886) and Pearson (Pearson 1930) written in 1886 and 1903 respectively further developed the foundations of regression models. They examined why sons of very tall fathers are usually shorter than their fathers, while sons of very short fathers are usually taller than their fathers. After collecting and analyzing data from hundreds of families, they concluded that the height of an individual in a community or population tends to “regress” to the average height of the such population where they were born. If the father is very tall, then his sons’ height will “regress” to the average height of such population. If the father is very short, then his sons’ height will also “regress” to such average height. They named their model as “regression” model. Nowadays the interpretation of regression models is not quite the same as “regress” to a specific average value. Nowadays regression models are used to examine linear relationships between a dependent variable and a set of independent variables.
7 Types of data structures
The market model is a time-series regression model. In this model we looked at the relationship between 2 variables representing one feature or attribute (returns) of two “subjects” over time: a stock and a market index. The market model is an example of a regression model, but the data structure or type of data used is time-series data. These type of regression models are called pulled time-series regression.
There are basically three types of data used in regression models:
Time-series: each observation represents one period, and each column represents one or more variables, which are characteristics of one or more subjects. Then, we have one or more variables measured in several time periods.
Cross-sectional: each observation represents one subject in only one time period, and each column represents variables or characteristics of the subjects.
Panel data: this is a combination of time-series with cross-sectional structure.
Then, we can consider the market model as a pulled “time-series” regression model.
Another way to classify regression models is based on the number of independent variables. If the regression model considers only one independent variable, the the model is known as simple regression model. If the model considers more than one independent variable, the model is known as multiple regression model.
8 The OLS method
The Ordinary Least Square is an optimization method to estimate the best values of the beta coefficients and their standard errors in a linear regression model.
In the case of simple regression model, we need to estimate the best values of beta0 (the intercept) and beta1 coefficient (the slope of the regression line). If Y is the dependent variable and X the independent variable, the regression equation is as follows:
y_i = b_0 + b_1(x_i) + \epsilon_i
The regression line is given by the expected value of Y (also called \hat{Y}):
E[y_i] = b_0 + b_1(x_i) = \hat{y}_i
In this case, i goes from 1 to N, the total number of observations of the sample.
If we plot all pairs of (x_i,y_i) we can first visualize whether there is a linear relationship between Y and X. In a scatter plot, each pair of (x_i,y_i) is one point in the plot.
The purpose of OLS is to find the best regression line that best represents all the points(x_i,y_i). The beta0 and beta1 coefficients define the regression line. If beta0 changes, then the intercept of the line moves. If beta1 changes, then the slope of the line changes.
To find the best regression line (beta0 and beta1), the OLS method tries to minimize the sum of squared errors of the regression equation. Then OLS is an optimization method with the following objective:
Minimize \sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}
If we replace \hat{Y}_i by the regression equation we get:
Then, this sum can be seen as a function of b_0 and b_1. If b_0 changes, this sum changes; the same with b_1. Then we can re-write the optimization objective as:
This is a quadratic function of 2 variables. If we get its 2 first partial derivatives (with respect to b_0 and b_1) and make them equal to zero we will get 2 equations and 2 unknowns, and the solution will be the optimal values of b_0 and b_1 that minimizes this sum of squared errors. The set of these 2 partial derivatives is also called the gradient of the function.
If you remember basic geometry, if we imagine possible values of b_0 in the X axis and values of b_1 in the Y axis, then this function is actually a single-curved area (surface). The optimal point (b_0, b_1) will be the lowest point of this curved surface.
Let’s do an example with a dataset of only 4 observations:
X
Y
1
2
2
-1
7
10
8
7
If I do a scatter plot and a line that fits the points:
Code
import pandas as pdimport matplotlibimport matplotlib.pyplot as pltimport numpy as npdata = {'x': [1,2,7,8],'y': [2,-1,10,7]}df = pd.DataFrame(data)b1,b0 = np.polyfit(df.x,df.y,1)df['yhat'] = b0 + b1*df['x']#plt.clf()plt.scatter(df.x,df.y)plt.plot(df.x, df.yhat,c="orange")plt.xticks(np.arange(-4,14,1))plt.yticks(np.arange(-2,11,1))for i inrange(4): x=df.x.iloc[i] ymin= df.y.iloc[i] ymax=df.yhat.iloc[i]if (ymin>ymax): temp=ymax ymax=ymin ymin=temp plt.vlines(x=x,ymin=ymin,ymax=ymax,color='r')plt.axhline(y=0)plt.axvline(x=0)plt.xlabel("X")plt.ylabel("Y")plt.grid()plt.show()
The error of each point is the red vertical line, which is the distance between the point and the prediction of the regression line. This distance is given by the difference between the specific y_i value and its predicted value:
\epsilon_i=(y_i - \hat{y}_i)
In this example, 2 errors are positive and 2 are negative.
The purpose of OLS is to find the values of b_0 and b_1 such that the sum of the squared of all errors is minimized. Mathematically there is only one solution for a specific set of X and Y values.
Let’s return to the objective function, which is determined by both coefficients, b_0 and b_1.
We can do a 3D plot for this function to have a better idea. It is a 3D plot since the function depends on 2 values: b_0 and b_1:
Code
from mpl_toolkits.mplot3d import Axes3Dimport matplotlib.pyplot as pltimport collections# I define a function to get the sum of squared errors given a specific b0 and b1 coefficients:def sumsqerrors2(b1, b0,df):returnsum( ( df.y - (b0+b1*df.x)) **2)# Note that df is a dataframe, so this line of code performs a row-wise operation to avoid # writing a loop to sum each squared error for each observation# Create the plot:fig = plt.figure()ax = fig.add_subplot(1,1,1, projection='3d')# I create 20 possible values of beta0 and beta1:# beta1 will move between -1 and 3b1s = np.linspace(-1, 3.0, 20)# beta0 will move between -2 and 2:b0s = np.linspace(-2, 2, 20)# I create a grid with all possible combinations of beta0 and beta1 using the meshgrid function:# M will be all the b1s values, and B the beta0 values:M, B = np.meshgrid(b1s, b0s)# I calculate the sum of squared errors with all possible pairs of beta0 and beta1 of the previous grid:zs = np.array([sumsqerrors2(mp, bp, df) for mp, bp inzip(np.ravel(M), np.ravel(B))])# I reshape the zs (squared errors) from a vector to a grid of the same size as M (20x20)Z = zs.reshape(M.shape)ax.plot_surface(M, B, Z, rstride=1, cstride=1, color='b', alpha=0.5)ax.set_xlabel('b1')ax.set_ylabel('b0')ax.set_zlabel('sum sq.errors')plt.show()
We see that this function is single-curved. The sum of squared errors changes with different paired values (b_0,b_1). The lowest point of this surface will be the optimal values of (b_0, b_1) where the sum of squared error is the minimum of all. Then, how can we calculate this optimal point?
Remembering a little bit of Calculus and simultaneous equations, we can do the following:
Take the partial derivatives of the function with respect to its variables b_0 and b_1 and make it equal to zero:
Since \left[(y_{i}-(b_{0}+b_{1}*x_{i})\right] is the error for the i observation, then this Equation 1 states that the sum of all errors must be equal to zero.
Let’s do the same for the other partial derivative:
Now we have 2 equations with 2 unknowns. I this case, the unknown variables are b_0 and b_1, since the X and Y values are considered as given since the start of the problem.
Solve the systems of equations
We can further simplify both equations by applying the sum operator to each term:
Since \bar{y}=\frac{1}{N}\sum_{i=1}^Ny_{i}, then \sum_{i=1}^{N}y_{i}=N*\bar{y}, and the same for the x variable, then:
N(\bar{y})-N(b_0)-b1(N)(\bar{x})=0
Dividing both sides by N:
\bar{y}-b_0-b_1(\bar{x})=0
This is another way to express Equation 1. This form of the equation indicates that the point(\bar{x},\bar{y}) must be in the regression line since the error at this point is equal to zero!
From previous equation, then we get that:
b_0=\bar{y} - b_1\bar{x}
After applying the sum to each term of Equation 2, it can be express as:
Interestingly, from this final formula for b_1, which is the slope of the regression line, we can see that b_1 is how much y covariates with x with respect to the variability of x. This is actually the concept of a slope of a line! it is the sensitivity of how much y changes when x changes. Then, b_1 can be seen as the expected rate of change of y with respect to x, which is a derivative!
How the b_1 coefficient and the correlation between x and y are related? We learned that correlation measures the linear relationship between 2 variables. Also, b_1 measures the linear relationship between 2 variables, however, there is an important difference. Let’s see the correlation formula:
Corr(x,y)=\frac{Cov(x,y)}{SD(x)SD(y)}
We can express Covariance in terms of Correlation:
Cov(x,y)=Corr(x,y)SD(x)SD(y)
If we plug this formula in the b_1 formula:
b_1=Corr(x,y)*\frac{SD(y)}{SD(x)}
Then, we can see that b_1 is a type of scaled correlation since it is that is scaled by the ratio of both standard deviations, so b_1 provides information not only about relationship, but also about sensitivity of how much y changes in magnitude for each change in 1 unit of x!
9 Quizzes
Go to Canvas and do Quiz 1 and Quiz 2. You have 3 attempts for each; only the best attempt is considered.
Quizzes 1 and 2 will count for 50% of Workshop 2 and 3 respectively.
10 References
Galton, Francis. 1886. “Family Likeness in Stature.”Proceedings of Royal Society 40: 42–72.
Gauss, Carl Friedrich. 1809. Theory of Motion of the Celestial Bodies Moving in Conic Sections Around the Sun.
Legendre, Adrien-Marie. 1805. Nouvelles Méthodes Pour La Détermination Des or-Bites Des Comètes.
Pearson, Karl Lee. 1930. The Life, Letters and Labors of Francis Galton.