Lecture 2

Review: Linear Equations

In 1 dimension given by $y=mx+b$
x is the independent variable
y is the dependent variable
m is the slope
b is the y intercept
(-b/m) is the x intercept, but usually not important in econometrics

Linear Equations: Examples

y intercept?
Slope?
Equation?

Slope and intercept interpretation

You rent a car. You estimate the cost per day to rent and operate the car is given by the equation

$Price = 30 + 0.2 * miles$

Interpret the slope and y intercept in words.

NOTE: “Interpret” in this class always refers to a basic explanation of a formula or coefficient which directly uses the numeric values and that a (reasonably) normal person can understand

Linear Equations: Higher Dimensions

In multiple dimensions will be given by $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k$
- $\beta_0$ is the intercept, y is the dependent variable
- Now we have k dependent variables, $x_k$, each with their own slope $\beta_k$

(Optional) Linear transformations as a matrix

In matrix notation: $Y=X\beta$
- $Y$ is an $n$ by 1 column vector (where n individuals are observed)
- $X$ is an $n$ by $k$ matrix
- $\beta$ is a $k$ by 1 column vector (where k parameters are used)
The definition of matrix multiplication makes this work out perfectly

(Optional) Matrix multiplication Example

$\begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix} = \begin{bmatrix}1 & x_{11} & x_{21} \\ 1 & x_{12} & x_{22}\\ 1 & x_{13} & x_{23}\end{bmatrix} * \begin{bmatrix}\beta_0 \\ \beta_1 \\ \beta_2\end{bmatrix}$

$\begin{bmatrix}y_1=\beta_0 + \beta_1 x_{11} + \beta_2 x_{21} \\ y_2 = \beta_0 + \beta_1 x_{12} + \beta_2 x_{22} \\ y_3 = \beta_0 + \beta_1 x_{13} + \beta_2 x_{23}\end{bmatrix}$

where, e.g. $x_{21}$ is $x_2$ for individual 1

Dropping subscripts: $y=\beta_0+\beta_1x_1+\beta_2x_2$

Higher Dimensions: Example

Linear Equations: Question

What is the formula for this line?

Stats Review

Data types

The names/categories aren’t important to know, but the interpretation is. This will show up frequently in class :

Interval: a 1 unit change is the same everywhere (dollars)
Ordinal: higher numbers are better, but the difference isn’t easily interpretable (e.g. you code 0 for a HS dropout, 1 for HS degree, 2 for some college, 3 for bachelor’s degree, etc)
Categorical: the number is just an identifier with no numeric meaning (zip code)

A 10% increase in revenue has a very different interpretation from a 10% increase in education or zip code.

Data types - a binary variable

You have data on employees at your company. You find the following average relationship between gender (0 for male and 1 for female) and height (in cm):

$height = 176 - 14*gender$

Interpret the slope of -14 in this context

Data strutures

Cross-sectional: A snapshot of data at a given time.Students’ grades in this classroom at end of semester.
Time Series: Study 1 unit over time. Your grade every week in this class.
Panel Data : Combination of the above two. Each student’s grade every week.

There are other distinctions based on whether the same individuals are followed over time
These will determine the subscripts of our regression equations later in the course

Data structures: Cross Section

Attendance vs grades, simulated from anonymized data

     student  baseline attendance     grade
  1:       1 0.9507893  0.9657565 110.46481
  2:       2 0.7784678  1.0628186  80.41383
  3:       3 1.0457980  1.0900670 112.30689
  4:       4 1.1704702  0.8975915  87.35327
  5:       5 0.9979116  0.3249460  72.55599
 ---                                       
205:     205 1.0530886  1.0166823  84.04216
206:     206 1.0585210  0.4560729  82.68328
207:     207 1.0844113  0.8926517  90.41293
208:     208 1.0179260  1.2323755  74.08424
209:     209 0.6371130  0.8295373  74.87772

Data Structures: Cross Section Plot

Data Structures: Time Series

           Date    Close
  1: 2021-11-01 402.8633
  2: 2021-11-02 390.6667
  3: 2021-11-03 404.6200
  4: 2021-11-04 409.9700
  5: 2021-11-05 407.3633
 ---                    
295: 2023-01-03 108.1000
296: 2023-01-04 113.6400
297: 2023-01-05 110.3400
298: 2023-01-06 113.0600
299: 2023-01-09 119.7700

Data Structures: Time Series Screenshot

Data Structures: Panel Screenshot

Data is simulated from actual anonymized data

      student week attendance    grade Roster           TA   Reason
   1:     147    9          0 24.21382      6      ThatGuy Canceled
   2:     110    5          1 24.58975      2 ThatOtherGuy         
   3:     209    2          1 20.23604      1      ThatGuy         
   4:      50   10          1 24.95760      5 ThatOtherGuy         
   5:     165   10          0 24.43024      6      ThatGuy         
  ---                                                              
2306:      82    8          1 18.87512      4      ThatGuy         
2307:     127    2          1 23.49707      2 ThatOtherGuy         
2308:     122   10          1 24.45239      3 ThatOtherGuy         
2309:       6    7          0 25.04620      5 ThatOtherGuy         
2310:      92    2          0 23.28823      1      ThatGuy

Panel Data

Populations vs Samples

When conducting research we need to define the population we’re interested in
- For average salary by educational attainment, do we include 16 year olds? 70 year olds? Unemployed individuals? Part-time?
For a population, a measure of interest is called a parameter (average income in the US)
When data is limited we need to estimate this statistic using a sample

Populations vs Samples

mean individual income in the US is $57,143 in 2021 (from FRED St Louis).
If we randomly sample 1000 individuals we may end up getting an average of $51,000 in this specific sample.
Here $\mu=57143, \hat\mu =51000$
- A hat is used to indicate an estimate
parameters are true values, statistics are estimates

Common Sample Statistics: mean

An individual observation of outcome x is labeled as $x_i,\ i=1,2,...,n$
You could also label them, e.g. $income_{sam}, income_{fred}$, etc.
Mean: $E[X]=\mu$ for population, $\bar x=\frac{x_1+x_2+...+x_n}{n}$ for sample
- E[] refers to the expectation, or weighted average, of a random variable
The mean is the average, or center, of the distribution

Sample Statistics Question:

$\bar x=?$

Common Sample Statistics: Variance

Variance: $\sigma^2=E[(x-\mu)^2]$ for population
- $=\frac{(x_1-\bar x)^2+ (x_2-\bar x)^2 + ... + (x_n-\bar x)^2}{n}$
- Where $\bar x$ was calculated as before
This is the average squared distance from the mean
- Why squared?
This measures the dispersion, or spread, of a distribution, rather than the center

Common Sample Statistics: Variance

For Samples: $s^2= \frac{(x_1-\bar x)^2+ (x_2-\bar x)^2 + ... + (x_n-\bar x)^2}{n-1}$ for sample
We divide by n-1 instead of n to make the result unbiased. This only matters for small samples.
Unbiased means that on average $s^2=\sigma^2$ (ie if we repeat an experiment many times)
Standard deviation: $\sigma=\sqrt\sigma^2,s=\sqrt s^2$

Calculating Standard Deviation

Same dataset, but now we want to calculate standard deviation. Steps?

Calculating Standard Deviation

First calculate variance, the average (squared) distance from the mean:
- subtract $\bar x$ from each observation
- Square the result
- Take the mean of this result

Calculating Standard Deviation: Tabular Calculation

    x xbar difference squaredDifference
1:  1  5.2       -4.2             17.64
2:  5  5.2       -0.2              0.04
3: -7  5.2      -12.2            148.84
4: 12  5.2        6.8             46.24
5: 15  5.2        9.8             96.04

[1] 308.8

This is the sum of squared differences. To get the variance divide by $n=5$ if a population ($\sigma^2$), or $n=4$ if a sample ($s^2$)
- $\sigma^2=61.76, s^2=77.2$

Calculating Standard Deviation: Tabular Calculation

Finally, squared units are weird, so take the square root to get the standard deviation ($\sigma$ or $s$)
- $\sigma=7.86, s=8.79$

A note on higher moments

The information in the variance and standard deviation is captured in $E[X^2]$, the second moment.
$E[X^3]$ gives the skewness of a distribution
$E[X^4]$ gives the tail weight (kurtosis)
Given $E[X], E[X^2],...E[X^n]$ for n from 1 to $\infty$ fully specifies all “well-behaved” distributions
- Analogous to Taylor’s theorem

Random Variables

A random variable is used to represent a random event
The variable itself is typically denoted as a capital letter, and the outcome a lowercase one
- $X$ is the roll of a die. $X=5$ means a specific roll was a 5. $X=x$ means a specific roll was $x$
Random Variable Operations:
- Measure the probability of a specific event: $P(X=5)$
- Measure the average value $E[X]$
- Transform them: $X^2$ is the value of a squared dice roll. $X+Y$ is the sum of 2 dice

Random Variable Representation: Dice

Suppose we toss a six-sided die. How can we represent this event mathematically?
We can specify all possible outcome and their associated probability
outcomes: {1,2,3,4,5,6}
probabilities: {$\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6}$}
How to visualize?

Random Variable Visualization: Histograms

Discrete Random Variables

If a random variable has a countable number of outcomes (like a dice roll or a coin flip) it is called discrete
- The function that generates the histogram is called the probability mass function, $p(x)$. $\sum p(x) = 1$
Two events are independent if their joint probability is the product of their individual probabilities
- $P(X \cap Y) = P(X) * P(Y)$
- Means that knowing the value of one random variable gives us no information about the other
- Extremely strong assumption!

Adding random variables: Dice

A roll of a 6 sided die is a random variable. $p(x)=\frac{1}{6}, x\le 6\ \, x \in\mathbb{N}$
If X is the event of a roll of 1 die, and Y of a second die, then $X+Y$ gives the probability associated with the outcome of rolling two dice
What is the probability that $X+Y=7$?
Calculation is called a convolution. Notice the symmetry!

Adding random variables: histogram

Adding random variables: Expectation

Adding random variables is hard, but their mean is easy to calculate
- Weighted average: $E[X]=\sum xp(x)$
Single die has mean $\frac{1+2+3+4+5+6}{6}=3.5$
Expectation is linear: $E[\sum \alpha X_i] = \alpha \sum E[X_i]$
Two dice ($X_1+X_2$) has mean $3.5+3.5=7$

Adding random variables: Variance

For independent events, $var(X+Y)=\sigma_{x+y}^2=\sigma^2_x+\sigma_y^2$
but $\sigma_{x+y}\neq \sigma_x+\sigma_y$
- and $\sigma^2_{\alpha x} = \alpha^2 \sigma^2_x$
In general, $\sigma^2_{x+y} = \sigma_x^2 +\sigma_y^2 + 2\rho\sigma_x\sigma_y$
- $\rho$ is the correlation coefficient

Continuous Random Variables

Variables that take uncountable values are called continuous random variables (e.g. height, distance)
- The function that generates the histogram is called the probability density function, $f(x)$. $\int_{-\infty}^{\infty} f(x)dx=1$
- the value at a single point does not give a probability. We need to take the area under the curve
The function $F(x)=P(X\le x)$ is called the cumulative distribution function (CDF). It is $\int_{-\infty}^xf(t)dt$

Adding random variables: Normal Distribution

General formula for adding two random variables is a nightmare (don’t memorize): $f_{x+y}(z)=\int_{-\infty}^{\infty}(\int_{-\infty}^tf(z-x)dx)f(t)dt$
One random variable let’s us easily calculate a mean (and sum) of random variables:
$f(x)=\frac{1}{\sqrt{2\pi}}e^{\frac{(\frac{x-\mu}{\sigma})^2}{2}}$
- This is a normal random variable with mean $\mu$ and standard deviation $\sigma$

Adding random variables: Normal Distribution

If X is normal we write this as $X \sim N(\mu,\sigma)$
$\frac{X_1+X_2+...+X_n}{n} \sim N(\mu,\frac{\sigma}{\sqrt{n}})$
- Almost no other distributions have this property (“closed under convolutions”)

Normal Distribution Usage

Given a normal distribution with mean 0 and standard deviation 1, what is the probability that $-1<x<1$?
Calculate using the area under the curve from -1 to 1
Can’t do by hand. Use either a table or the function pnorm in R ($\Phi(z)$)
68% of data lies between -1 and +1 standard deviation (=pnorm(1)-pnorm(-1))
95% of data is between 2 standard deviations of mean, and 99.7% within 3 standard deviations

Normal distribution plot

Interpreting Mean and SD

The following is a normal distribution with mean 1 and standard deviation 1. What will the graph look like if instead we have mean 2 and standard deviation 1/2?

Random variables: Population vs Sampling: graph

https://mcfortran.shinyapps.io/sampling/

Sample Statistics as Random variables

Our sample statistic, e.g. $\bar x$ will change every time we use a different sample
Each individual observation, $x_i$ is a random variable. Our sample statistic is then a transformation of this vector of $x_i$s: $\bar x = (x_1+x_2 + ... x_n)/n$
The sampling distribution described above will be different from the population distribution (each $x_i$ is just a random draw from the population distribution)

Sampling Moments

Sampling distributions have their own mean and variance (and other moments)
$E[\bar x]=\mu$
$var(\bar X) = \frac{var(X)}{n}=\frac{\sigma^2}{n}$
- $sd(\bar x)=\frac{\sigma}{\sqrt n}$
How to calculate the probability distribution?

Sampling Moments

as $n\to\infty$ $\bar x$ approaches a distribution with mean $\mu$ and variance 0
We can state this a bit more precisely using the central limit theorem
Caveat: we have assumed our samples are drawn independently. If observations are related to each other this is violated

Administrative Miscellanea

Homework 2 due next Wednesday before class
Problem set 1 due Friday at midnight
Quiz 1 next Wednesday in class
Finish review, start with bivariate OLS today
I’ll skip on programming lab 2 that was scheduled for Monday
- Just make sure you’re set up for the first problem set
- Please come to office hours (or schedule time) if having issues

Central Limit Theorem

If we have independent, identically distributed (iid) random samples from a population with finite variance the Central Limit Theorem applies:
$lim_{n\to\infty} \bar x=N(\mu,\frac{\sigma}{\sqrt{n}})$
Once samples get large our sampling distribution becomes normally distributed with declining variance, regardless of the shape of our population distribution
Many consider this to be the most beautiful theorem in mathematics

Central Limit Theorem - Illustration

https://mcfortran.shinyapps.io/sampling/

Central Limit Theorem - Intuition 1

When we added dice, we had natural symmetry. If our distributions are identical this will always arise
- And if they are independent then we just need to multiply the probabilities piecewise
- So i.i.d. distributions should result in this symmetry
- Outliers become exponentially more unlikely and smoothed around the center

Central Limit Theorem - Intuition 2

Moments ($E[X^n]$) had easy math. So calculate every moment and translate this back to probabilities
- This is done with a moment generating function: the Laplace transform of the probability function
- $M_X(t)=E[e^{tx}] \implies E[X^n]=\frac{\partial^n M_X(t)}{\partial t^n}|_0$
  - Far beyond this course, but a common technique in graduate level math
This is how you formally prove the central limit theorem - The third moments and higher all decline very quickly, leaving only the mean and variance

Central Limit Theorem - Intuition 3

The normal distribution stays normal when taking averages
- The normal distribution is the only finite variance distribution with this property
- So any non-normal distribution will end up gravitating towards a normal distribution
- Formula: $e^{-x^2}$ is its own Fourier transform (related to the Laplace transform)
  - The fourier transform makes convolutions easier to calculate. See convolution theorem.

Cross Moments: covariance and correlation

For two random variables (or columns of data) the average product of the two, $E[XY]$, captures important information
Like variance, we subtract out the mean to make it more interpretable

Cross Moments: covariance and correlation

We define $cov(X,Y)=E[(x-\mu_x)(y-\mu_y)]$
If X and Y are both above their mean, the product is positive. If they’re both below the mean it’s also positive
If one value is above the mean and the other is below their product is negative
This then gives a measure of how closely associates X and Y are. If they move in the same direction it is positive, opposite directions is negative. “Independent” is zero

Cross Moments: covariance and correlation

As long as $cov(X,Y)\neq 0$, if we learn the value of X, we also learn something about Y
- If X and Y are independent, the covariance is 0. The reverse is not true though!

Cross moments: correlation

Standardize further: standardize covariance to be in the range $[-1,1]$ by dividing by $\sigma_x\sigma_y$. Called correlation
$\rho\equiv cov(X,Y)/(\sigma_x\sigma_y)$
The sample statistic for $\rho$ is usually called $r$ instead of $\hat\rho$

Correlation: visualization

https://mcfortran.shinyapps.io/correlation/

Correlation: Calculation

Is $r>0, r<0,$ or, $r=0$?

Correlation: Calculation

Steps to calculate?
$cov(x,y)=E[(x-\bar x)(y-\bar y)]$:
- Calculate the mean of x and of y
- subtract the mean from each observation
- multiply the two results
- take the average
Once we have covariance, standardize to get $r$ ( or $\rho$)
- $cov(x,y)/(\sigma_x\sigma_y)$
- $\sigma$ calculated as before: subtract out the mean from each observation, square it, and take the average

Correlation: Calculation

   x  y xbar ybar x-xbar y-ybar product
1: 1  3    3  7.6     -2   -4.6     9.2
2: 2  5    3  7.6     -1   -2.6     2.6
3: 3 10    3  7.6      0    2.4     0.0
4: 4  8    3  7.6      1    0.4     0.4
5: 5 12    3  7.6      2    4.4     8.8

cov(x,y):  4.2

   x  y xbar ybar x-xbar y-ybar product devx  devy
1: 1  3    3  7.6     -2   -4.6     9.2    4 21.16
2: 2  5    3  7.6     -1   -2.6     2.6    1  6.76
3: 3 10    3  7.6      0    2.4     0.0    0  5.76
4: 4  8    3  7.6      1    0.4     0.4    1  0.16
5: 5 12    3  7.6      2    4.4     8.8    4 19.36

SD x:  1.414214

SD y:  3.261901

r:  0.9104655

Correlation: Observation

Once we subtract out the mean of x and y, we are always on the standard x-y plane.
We can now calculate the contribution to covariance by looking at x*y for each point.
An observation at point (x,y) after subtracting $\bar x, \bar y$ will always contribute the same amount to the covariance
Points near y=x or y=-x have the largest impact, and points that are further from the center

Correlation: Observation

More extreme observations (outliers) are highly influential
Note exactly on y=x may contribute little if close to (0,0), but they also contribute little to both $\sigma_x$ and $\sigma_y$, so $\rho$ may not be affected much