Lecture 2

Administrative Miscellanea

Sign up for iClicker (see blackboard for link)
Datacamp available for free (optional but useful for learning R):
- See Blackboard for link
- I picked out some relevant courses that are “due” December 1st

Review: Linear Equations

In 1 dimension given by $y=mx+b$
x is the independent variable
y is the dependent variable
m is the slope
b is the y intercept
(-b/m) is the x intercept, but usually not important in econometrics

Linear Equations: Examples

Linear Equations: Higher Dimensions

In multiple dimensions will be given by $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$
- $\beta_0$ is the intercept, y is the dependent variable
- Now we have n dependent variables, $x_n$, each with their own slope $\beta_n$
  - If you go on into higher classes you’ll interpret each $x_i$ as a column vector, and $X$ as a matrix

Higher Dimensions: Example

Linear Equations: Question

What is the formula for this line?

Answer:

$y = mx + b$
$(0,7)$ gives y intercept: $b=7$
$(0,7) \to (1,5)$ gives slope: $m=-2$ (we can use any 2 points by taking $\frac{\Delta y}{\Delta x}$)
$y=-2x+7$

Stats Review

Data types

The names/categories aren’t important to know, but the interpretation is. This will show up frequently in class :

Interval: a 1 unit change is the same everywhere (dollars)
Ordinal: higher numbers are better, but the difference isn’t easily interpretable (e.g. 8 vs 12 vs 16 years of school)
Categorical: the number is just an identifier (zip code)

A 10% increase in revenue has a very different interpretation from a 10% increase in education or 10% “increase” in zip code. This should always be context specific.

Data strutures

Cross-sectional: A snapshot of data at a given time.Students grades in this classroom at end of semester.
Time Series: Study 1 unit over time. Your grade every week in this class.
Panel Data : Combination of the above two. A cross section of grades every week in this class.

There are other distinctions based on whether the same individuals are followed over time
These will determine the subscripts of our regression equations later in the course

Data structures: Cross Section

Attendance vs grades, simulated from anonymized data

     student  baseline attendance     grade
  1:       1 0.9507893  0.9657565 110.46481
  2:       2 0.7784678  1.0628186  80.41383
  3:       3 1.0457980  1.0900670 112.30689
  4:       4 1.1704702  0.8975915  87.35327
  5:       5 0.9979116  0.3249460  72.55599
 ---                                       
205:     205 1.0530886  1.0166823  84.04216
206:     206 1.0585210  0.4560729  82.68328
207:     207 1.0844113  0.8926517  90.41293
208:     208 1.0179260  1.2323755  74.08424
209:     209 0.6371130  0.8295373  74.87772

Data Structures: Cross Section Plot

Data Structures: Time Series

           Date    Close
  1: 2021-11-01 402.8633
  2: 2021-11-02 390.6667
  3: 2021-11-03 404.6200
  4: 2021-11-04 409.9700
  5: 2021-11-05 407.3633
 ---                    
295: 2023-01-03 108.1000
296: 2023-01-04 113.6400
297: 2023-01-05 110.3400
298: 2023-01-06 113.0600
299: 2023-01-09 119.7700

Data Structures: Time Series Screenshot

Data Structures: Panel Screenshot

Data is simulated from actual anonymized data

      student week attendance      grade Roster           TA Reason
   1:      97   11          1 24.3964171      2 ThatOtherGuy       
   2:     181   10          1 24.5029410      6      ThatGuy       
   3:      83    5          1 25.0527665      5 ThatOtherGuy       
   4:      87    1          1 21.1215230      6      ThatGuy       
   5:     157    5          1 23.6417363      2 ThatOtherGuy       
  ---                                                              
2306:     115    3          1 21.6190862      6      ThatGuy       
2307:     150    5          1  0.6772831      4      ThatGuy       
2308:      31    7          1 24.4333309      1      ThatGuy       
2309:     175    9          1 21.5432014      4      ThatGuy       
2310:     109   11          0 25.1429089      2 ThatOtherGuy

Populations vs Samples

For a population, a measure of interest is called a parameter (average income in the US)
When data is limited we need to estimate this statistic using a sample

e.g. mean individual income in the US is $57,143 in 2021 (from FRED St Louis). If we randomly sample 1000 individuals we may end up getting an average of $51,000 in this specific sample. Here $\mu=57143, \hat\mu =51000$

Common Sample Statistics: mean

An individual observation of outcome x is labeled as $x_i,\ i=1,2,...,n$
You could also label them, e.g. $income_{sam}, income_{fred}$, etc.
Mean: $E[X]=\mu$ for population, $\bar x=\frac{x_1+x_2+...+x_n}{n}$ for sample
- E[] refers to the expectation, or average, of a random variable
The mean is the average, or center, of the distribution

Sample Statistics Question:

$\bar x=?$

Sample Statistics Answer

$x_1=1,x_2=5,x_3=-7,x_4=12,x_5=15$
$\sum x_i = x_1+x_2+...+x_5=26$
$n=5 \implies \bar x=26/5=5.2$
- If this is a population we call this $\mu=E[X]$ instead of $\bar x$

Common Sample Statistics: Variance

Variance: $\sigma^2=E[(x-\mu)^2]$ for population
- $=\frac{(x_1-\bar x)^2+ (x_2-\bar x)^2 + ... + (x_n-\bar x)^2}{(n)}$
- Where $\bar x$ was calculated as before
This is the average distance from the mean
This measures the spread of a distribution, rather than the center

Common Sample Statistics: Variance

For Samples: $s^2= \frac{(x_1-\bar x)^2+ (x_2-\bar x)^2 + ... + (x_n-\bar x)^2}{(n-1)}$ for sample
We divide by n-1 instead of n to make the result unbiased. This only matters for small samples.
Unbiased means that on average $s^2=\sigma^2$ (ie if we repeat an experiment many times)
Standard deviation: $\sigma=\sqrt(\sigma^2),s=\sqrt(s^2)$

Calculating Standard Deviation

Same dataset, but now we want to calculate standard deviation. Steps?

Calculating Standard Deviation

First calculate variance, the average (squared) distance from the mean:
- subtract $\bar x$ from each observation
- Square the result
- Take the mean of this result

Calculating Standard Deviation: Tabular Calculation

    x xbar difference squaredDifference
1:  1  5.2       -4.2             17.64
2:  5  5.2       -0.2              0.04
3: -7  5.2      -12.2            148.84
4: 12  5.2        6.8             46.24
5: 15  5.2        9.8             96.04

[1] 308.8

This is the sum of squared differences. To get the variance divide by $n=5$ if a population ($\sigma^2$), or $n=4$ if a sample ($s^2$)
- $\sigma^2=61.76, s^2=77.2$
Finally, squared units are weird, so take the square root to get the standard deviation ($\sigma$ or $s$)
- $\sigma=7.86, s=8.79$

A note on higher moments

The information in the variance and standard deviation is captured in $E[X^2]$.
$E[X^3]$ gives the skewness of a distribution
$E[X^4]$ gives the tail weight (kurtosis)
Given $E[X], E[X^2],...E[X^n]$ for n from 1 to $\infty$ fully specifies a distribution
- There are technically pathological exceptions to this

Random variables

Our sample statistic, e.g. $\bar x$ will change every time we use a different sample. We call this a random variable.
More specifically, each individual observation, $x_i$ is a random variable. Our sample statistic is then a transformation of this vector of $x_i$s: $\bar x = (x_1+x_2 + ... x_n)/n$
The sampling distribution described above will be different from the population distribution (each $x_i$ is just a random draw from the population distribution)

Random Variable Representation: Dice

Suppose we toss a six-sided die. How can we represent this event mathematically?
We can specify all possible outcome and their associated probability
outcomes: {1,2,3,4,5,6}
probabilities: {$\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6}$}
How to visualize?

Random Variable Visualization: Histograms

Adding random variables: Dice

A roll of a 6 sided die is a random variable. it is specified by associating a probability of $\frac{1}{6}$ to the outcomes $X=1,X=2,...X=6$
If X is the event of a roll of 1 die, and Y of a second die, then $X+Y$ gives the probability associated with the outcome of rolling two dice
What is the probability that $X+Y=7$?
Calculation is called a convolution. Notice the symmetry!

Adding random variables: histogram

Adding random variables: Expectation

Adding random variables is hard, but their mean is easy to calculate
Single die has mean $\frac{1+2+3+4+5+6}{6}=3.5$
Two dice ($X_1+X_2$) has mean 7
Similarly, $E[X^2]$ is easy to compute
- Variance/sd can be easily calculated given both $E[X]$ and $E[X^2]$

Adding random variables: Normal Distribution

One random variable let’s us easily calulate a mean (and sum) of random variables:
$f(x)=\frac{1}{\sqrt{2\pi}}e^{\frac{(\frac{x-\mu}{\sigma})^2}{2}}$
- This is a normal random variable with mean $\mu$ and standard deviation $\sigma$
- Written as $X \sim N(\mu,\sigma)$
$\frac{X_1+X_2+...+X_n}{n} \sim N(\mu,\frac{\sigma}{\sqrt{n}})$

Normal Distribution Usage

Given a normal distribution with mean 0 and standard deviation 1, what is the probability that $-1<x<1$?
Calculate using the area under the curve from -1 to 1
Can’t do by hand. Use either a table or the function pnorm in R ($\Phi(z)$)
68% of data lies between -1 and +1 standard deviation
95% of data is between 2 standard deviations of mean, and 99.7% within 3 standard deviations

Normal distribution plot

Interpreting Mean and SD

The following is a normal distribution with mean 1 and standard deviation 1. What will the graph look like if instead we have mean 2 and standard deviation 1/2?

Interpreting Mean and SD - Result

Random variables: Population vs Sampling: graph

https://mcfortran.shinyapps.io/sampling/

Sampling Moments

Sampling distributions have their own mean and variance (and other moments)
$E[\bar x]=\mu$
$var(\bar X) = \frac{var(X)}{n}=\frac{\sigma^2}{n}$
- $sd(\bar x)=\frac{\sigma}{\sqrt n}$
How to calculate?

Sampling Moments

as $n\to\infty$ $\bar x$ approaches a distribution with mean $\mu$ and variance 0
We can state this a bit more precisely using the central limit theorem
Caveat: we have assumed our samples are drawn independently. If observations are related to each other this is violated

Central Limit Theorem

If we have independent, identically distributed (iid) random samples from a population with finite variance the Central Limit Theorem applies:
$lim_{n\to\infty} \bar x=N(\mu,\frac{\sigma}{\sqrt{n}})$
Once samples get large our sampling distribution becomes normally distributed with declining variance, regardless of the shape of our population distribution
Many consider this to be the most beautiful theorem in mathematics

Central Limit Theorem - Illustration

https://mcfortran.shinyapps.io/sampling/

Central Limit Theorem - Intuition 1

When we added dice, we had natural symmetry. If our distributions are identical this will always arise
- And if they are independent then we just need to multiply the probabilities piecewise
- So i.i.d. distributions should result in this symmetry

Central Limit Theorem - Intuition 2

Moments ($E[X^n]$) had easy math. So calculate every moment and translate this back to probabilities
- This is done with a moment generating function: the laplace transform of the probability function
- $M_X(t)=E[e^{tx}] \implies E[X^n]=\frac{\partial^n M_X(t)}{\partial t^n}|_0$
  - Far beyond this course, but a common technique in graduate level math
This is how you formally prove the central limit theorem

Central Limit Theorem - Intuition 3

The normal distribution stays normal when taking averages
- The normal distribution is the only finite variance distribution with this property
- So any non-normal distribution will end up gravitating towards a normal distribution
- Formula: $e^{-x^2}$ is its own fourier transform (related to the laplace transform)
  - The fourier transform makes convolutions easier to calculate

Cross Moments: covariance and correlation

For two random variables (or columns of data) the average product of the two, $E[XY]$, captures important information
Like variance, we subtract out the mean to make it more interpretable

Cross Moments: covariance and correlation

We define $cov(X,Y)=E[(x-\mu_x)(y-\mu_y)]$
If X and Y are both above their mean, the product is positive. If they’re both below the mean it’s also positive
If one value is above the mean and the other is below their product is negative
This then gives a measure of how closely associates X and Y are. If they move in the same direction it is positive, opposite directions is negative. “Independent” is zero

Cross moments: correlation

Standardize further: standardize covariance to be in the range $[-1,1]$ by dividing by $\sigma_x\sigma_y$. Called correlation
$\rho\equiv cov(X,Y)/(\sigma_x\sigma_y)$
The sample statistic for $\rho$ is usually called $r$ instead of $\hat\rho$

Correlation: visualization

https://mcfortran.shinyapps.io/correlation/

Correlation: Calculation

Is $r>0, r<0,$ or, $r=0$?

Correlation: Calculation

Steps to calculate?
$cov(x,y)=E[(x-\bar x)(y-\bar y)]$:
- Calculate the mean of x and of y
- subtract the mean from each observation
- multiply the two results
- take the average
Once we have covariance, standardize to get $r$ ( or $\rho$)
- $cov(x,y)/(\sigma_x\sigma_y)$
- $\sigma$ calculated as before: subtract out the mean from each observation, square it, and take the average

Correlation: Calculation

   x  y xbar ybar x-xbar y-ybar product
1: 1  3    3  7.6     -2   -4.6     9.2
2: 2  5    3  7.6     -1   -2.6     2.6
3: 3 10    3  7.6      0    2.4     0.0
4: 4  8    3  7.6      1    0.4     0.4
5: 5 12    3  7.6      2    4.4     8.8

cov(x,y):  4.2

   x  y xbar ybar x-xbar y-ybar product devx  devy
1: 1  3    3  7.6     -2   -4.6     9.2    4 21.16
2: 2  5    3  7.6     -1   -2.6     2.6    1  6.76
3: 3 10    3  7.6      0    2.4     0.0    0  5.76
4: 4  8    3  7.6      1    0.4     0.4    1  0.16
5: 5 12    3  7.6      2    4.4     8.8    4 19.36

SD x:  1.414214

SD y:  3.261901

r:  0.9104655

Correlation: Observation

Once we subtract out the mean of x and y, we are always on the standard x-y plane.
We can now calculate the contribution to covariance by looking at x*y for each point.
An observation at point (x,y) after subtracting $\bar x, \bar y$ will always contribute the same amount to the covariance
Points near y=x or y=-x have the largest impact, and points that are further from the center

Correlation: Observation

More extreme observations (outliers) are highly influential
Note exactly on y=x may contribute little if close to (0,0), but they also contribute little to both $\sigma_x$ and $\sigma_y$, so $\rho$ may not be affected much