Lecture 2

Administrative Miscellanea

  • Sign up for iClicker (see blackboard for link)
  • Datacamp available for free (optional but useful for learning R):
    • See Blackboard for link
    • I picked out some relevant courses that are “due” December 1st

Review: Linear Equations

  • In 1 dimension given by \(y=mx+b\)
  • x is the independent variable
  • y is the dependent variable
  • m is the slope
  • b is the y intercept
  • (-b/m) is the x intercept, but usually not important in econometrics

Linear Equations: Examples

Linear Equations: Higher Dimensions

  • In multiple dimensions will be given by \(y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n\)
    • \(\beta_0\) is the intercept, y is the dependent variable
    • Now we have n dependent variables, \(x_n\), each with their own slope \(\beta_n\)
      • If you go on into higher classes you’ll interpret each \(x_i\) as a column vector, and \(X\) as a matrix

Higher Dimensions: Example

Linear Equations: Question

  • What is the formula for this line?

Answer:

  • \(y = mx + b\)
  • \((0,7)\) gives y intercept: \(b=7\)
  • \((0,7) \to (1,5)\) gives slope: \(m=-2\) (we can use any 2 points by taking \(\frac{\Delta y}{\Delta x}\))
  • \(y=-2x+7\)

Stats Review

Data types

The names/categories aren’t important to know, but the interpretation is. This will show up frequently in class :

  • Interval: a 1 unit change is the same everywhere (dollars)
  • Ordinal: higher numbers are better, but the difference isn’t easily interpretable (e.g. 8 vs 12 vs 16 years of school)
  • Categorical: the number is just an identifier (zip code)
  • A 10% increase in revenue has a very different interpretation from a 10% increase in education or 10% “increase” in zip code. This should always be context specific.

Data strutures

  • Cross-sectional: A snapshot of data at a given time.Students grades in this classroom at end of semester.
  • Time Series: Study 1 unit over time. Your grade every week in this class.
  • Panel Data : Combination of the above two. A cross section of grades every week in this class.
  • There are other distinctions based on whether the same individuals are followed over time
  • These will determine the subscripts of our regression equations later in the course

Data structures: Cross Section

Attendance vs grades, simulated from anonymized data

     student  baseline attendance     grade
  1:       1 0.9507893  0.9657565 110.46481
  2:       2 0.7784678  1.0628186  80.41383
  3:       3 1.0457980  1.0900670 112.30689
  4:       4 1.1704702  0.8975915  87.35327
  5:       5 0.9979116  0.3249460  72.55599
 ---                                       
205:     205 1.0530886  1.0166823  84.04216
206:     206 1.0585210  0.4560729  82.68328
207:     207 1.0844113  0.8926517  90.41293
208:     208 1.0179260  1.2323755  74.08424
209:     209 0.6371130  0.8295373  74.87772

Data Structures: Cross Section Plot

Data Structures: Time Series

           Date    Close
  1: 2021-11-01 402.8633
  2: 2021-11-02 390.6667
  3: 2021-11-03 404.6200
  4: 2021-11-04 409.9700
  5: 2021-11-05 407.3633
 ---                    
295: 2023-01-03 108.1000
296: 2023-01-04 113.6400
297: 2023-01-05 110.3400
298: 2023-01-06 113.0600
299: 2023-01-09 119.7700

Data Structures: Time Series Screenshot

Data Structures: Panel Screenshot

Data is simulated from actual anonymized data

      student week attendance      grade Roster           TA Reason
   1:      97   11          1 24.3964171      2 ThatOtherGuy       
   2:     181   10          1 24.5029410      6      ThatGuy       
   3:      83    5          1 25.0527665      5 ThatOtherGuy       
   4:      87    1          1 21.1215230      6      ThatGuy       
   5:     157    5          1 23.6417363      2 ThatOtherGuy       
  ---                                                              
2306:     115    3          1 21.6190862      6      ThatGuy       
2307:     150    5          1  0.6772831      4      ThatGuy       
2308:      31    7          1 24.4333309      1      ThatGuy       
2309:     175    9          1 21.5432014      4      ThatGuy       
2310:     109   11          0 25.1429089      2 ThatOtherGuy       

Populations vs Samples

  • For a population, a measure of interest is called a parameter (average income in the US)
  • When data is limited we need to estimate this statistic using a sample
  • e.g. mean individual income in the US is $57,143 in 2021 (from FRED St Louis). If we randomly sample 1000 individuals we may end up getting an average of $51,000 in this specific sample. Here \(\mu=57143, \hat\mu =51000\)

Common Sample Statistics: mean

  • An individual observation of outcome x is labeled as \(x_i,\ i=1,2,...,n\)
  • You could also label them, e.g. \(income_{sam}, income_{fred}\), etc.
  • Mean: \(E[X]=\mu\) for population, \(\bar x=\frac{x_1+x_2+...+x_n}{n}\) for sample
    • E[] refers to the expectation, or average, of a random variable
  • The mean is the average, or center, of the distribution

Sample Statistics Question:

    x
1:  1
2:  5
3: -7
4: 12
5: 15
  • \(\bar x=?\)

Sample Statistics Answer

  • \(x_1=1,x_2=5,x_3=-7,x_4=12,x_5=15\)
  • \(\sum x_i = x_1+x_2+...+x_5=26\)
  • \(n=5 \implies \bar x=26/5=5.2\)
    • If this is a population we call this \(\mu=E[X]\) instead of \(\bar x\)

Common Sample Statistics: Variance

  • Variance: \(\sigma^2=E[(x-\mu)^2]\) for population
    • \(=\frac{(x_1-\bar x)^2+ (x_2-\bar x)^2 + ... + (x_n-\bar x)^2}{(n)}\)
    • Where \(\bar x\) was calculated as before
  • This is the average distance from the mean
  • This measures the spread of a distribution, rather than the center

Common Sample Statistics: Variance

  • For Samples: \(s^2= \frac{(x_1-\bar x)^2+ (x_2-\bar x)^2 + ... + (x_n-\bar x)^2}{(n-1)}\) for sample
  • We divide by n-1 instead of n to make the result unbiased. This only matters for small samples.
  • Unbiased means that on average \(s^2=\sigma^2\) (ie if we repeat an experiment many times)
  • Standard deviation: \(\sigma=\sqrt(\sigma^2),s=\sqrt(s^2)\)

Calculating Standard Deviation

    x
1:  1
2:  5
3: -7
4: 12
5: 15
  • Same dataset, but now we want to calculate standard deviation. Steps?

Calculating Standard Deviation

  • First calculate variance, the average (squared) distance from the mean:
    • subtract \(\bar x\) from each observation
    • Square the result
    • Take the mean of this result

Calculating Standard Deviation: Tabular Calculation

    x xbar difference squaredDifference
1:  1  5.2       -4.2             17.64
2:  5  5.2       -0.2              0.04
3: -7  5.2      -12.2            148.84
4: 12  5.2        6.8             46.24
5: 15  5.2        9.8             96.04
[1] 308.8
  • This is the sum of squared differences. To get the variance divide by \(n=5\) if a population (\(\sigma^2\)), or \(n=4\) if a sample (\(s^2\))
    • \(\sigma^2=61.76, s^2=77.2\)
  • Finally, squared units are weird, so take the square root to get the standard deviation (\(\sigma\) or \(s\))
    • \(\sigma=7.86, s=8.79\)

A note on higher moments

  • The information in the variance and standard deviation is captured in \(E[X^2]\).
  • \(E[X^3]\) gives the skewness of a distribution
  • \(E[X^4]\) gives the tail weight (kurtosis)
  • Given \(E[X], E[X^2],...E[X^n]\) for n from 1 to \(\infty\) fully specifies a distribution
    • There are technically pathological exceptions to this

Random variables

  • Our sample statistic, e.g. \(\bar x\) will change every time we use a different sample. We call this a random variable.
  • More specifically, each individual observation, \(x_i\) is a random variable. Our sample statistic is then a transformation of this vector of \(x_i\)s: \(\bar x = (x_1+x_2 + ... x_n)/n\)
  • The sampling distribution described above will be different from the population distribution (each \(x_i\) is just a random draw from the population distribution)

Random Variable Representation: Dice

  • Suppose we toss a six-sided die. How can we represent this event mathematically?
  • We can specify all possible outcome and their associated probability
  • outcomes: {1,2,3,4,5,6}
  • probabilities: {\(\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6}\)}
  • How to visualize?

Random Variable Visualization: Histograms

Adding random variables: Dice

  • A roll of a 6 sided die is a random variable. it is specified by associating a probability of \(\frac{1}{6}\) to the outcomes \(X=1,X=2,...X=6\)
  • If X is the event of a roll of 1 die, and Y of a second die, then \(X+Y\) gives the probability associated with the outcome of rolling two dice
  • What is the probability that \(X+Y=7\)?
  • Calculation is called a convolution. Notice the symmetry!

Adding random variables: histogram

Adding random variables: Expectation

  • Adding random variables is hard, but their mean is easy to calculate
  • Single die has mean \(\frac{1+2+3+4+5+6}{6}=3.5\)
  • Two dice (\(X_1+X_2\)) has mean 7
  • Similarly, \(E[X^2]\) is easy to compute
    • Variance/sd can be easily calculated given both \(E[X]\) and \(E[X^2]\)

Adding random variables: Normal Distribution

  • One random variable let’s us easily calulate a mean (and sum) of random variables:
  • \(f(x)=\frac{1}{\sqrt{2\pi}}e^{\frac{(\frac{x-\mu}{\sigma})^2}{2}}\)
    • This is a normal random variable with mean \(\mu\) and standard deviation \(\sigma\)
    • Written as \(X \sim N(\mu,\sigma)\)
  • \(\frac{X_1+X_2+...+X_n}{n} \sim N(\mu,\frac{\sigma}{\sqrt{n}})\)

Normal Distribution Usage

  • Given a normal distribution with mean 0 and standard deviation 1, what is the probability that \(-1<x<1\)?
  • Calculate using the area under the curve from -1 to 1
  • Can’t do by hand. Use either a table or the function pnorm in R (\(\Phi(z)\))
  • 68% of data lies between -1 and +1 standard deviation
  • 95% of data is between 2 standard deviations of mean, and 99.7% within 3 standard deviations

Normal distribution plot

Interpreting Mean and SD

The following is a normal distribution with mean 1 and standard deviation 1. What will the graph look like if instead we have mean 2 and standard deviation 1/2?

Interpreting Mean and SD - Result

Random variables: Population vs Sampling: graph

https://mcfortran.shinyapps.io/sampling/

Sampling Moments

  • Sampling distributions have their own mean and variance (and other moments)
  • \(E[\bar x]=\mu\)
  • \(var(\bar X) = \frac{var(X)}{n}=\frac{\sigma^2}{n}\)
    • \(sd(\bar x)=\frac{\sigma}{\sqrt n}\)
  • How to calculate?

Sampling Moments

  • as \(n\to\infty\) \(\bar x\) approaches a distribution with mean \(\mu\) and variance 0
  • We can state this a bit more precisely using the central limit theorem
  • Caveat: we have assumed our samples are drawn independently. If observations are related to each other this is violated

Central Limit Theorem

  • If we have independent, identically distributed (iid) random samples from a population with finite variance the Central Limit Theorem applies:
  • \(lim_{n\to\infty} \bar x=N(\mu,\frac{\sigma}{\sqrt{n}})\)
  • Once samples get large our sampling distribution becomes normally distributed with declining variance, regardless of the shape of our population distribution
  • Many consider this to be the most beautiful theorem in mathematics

Central Limit Theorem - Illustration

https://mcfortran.shinyapps.io/sampling/

Central Limit Theorem - Intuition 1

  • When we added dice, we had natural symmetry. If our distributions are identical this will always arise
    • And if they are independent then we just need to multiply the probabilities piecewise
    • So i.i.d. distributions should result in this symmetry

Central Limit Theorem - Intuition 2

  • Moments (\(E[X^n]\)) had easy math. So calculate every moment and translate this back to probabilities
    • This is done with a moment generating function: the laplace transform of the probability function
    • \(M_X(t)=E[e^{tx}] \implies E[X^n]=\frac{\partial^n M_X(t)}{\partial t^n}|_0\)
      • Far beyond this course, but a common technique in graduate level math
  • This is how you formally prove the central limit theorem

Central Limit Theorem - Intuition 3

  • The normal distribution stays normal when taking averages
    • The normal distribution is the only finite variance distribution with this property
    • So any non-normal distribution will end up gravitating towards a normal distribution
    • Formula: \(e^{-x^2}\) is its own fourier transform (related to the laplace transform)
      • The fourier transform makes convolutions easier to calculate

Cross Moments: covariance and correlation

  • For two random variables (or columns of data) the average product of the two, \(E[XY]\), captures important information
  • Like variance, we subtract out the mean to make it more interpretable

Cross Moments: covariance and correlation

  • We define \(cov(X,Y)=E[(x-\mu_x)(y-\mu_y)]\)
  • If X and Y are both above their mean, the product is positive. If they’re both below the mean it’s also positive
  • If one value is above the mean and the other is below their product is negative
  • This then gives a measure of how closely associates X and Y are. If they move in the same direction it is positive, opposite directions is negative. “Independent” is zero

Cross moments: correlation

  • Standardize further: standardize covariance to be in the range \([-1,1]\) by dividing by \(\sigma_x\sigma_y\). Called correlation
  • \(\rho\equiv cov(X,Y)/(\sigma_x\sigma_y)\)
  • The sample statistic for \(\rho\) is usually called \(r\) instead of \(\hat\rho\)

Correlation: visualization

https://mcfortran.shinyapps.io/correlation/

Correlation: Calculation

   x  y
1: 1  3
2: 2  5
3: 3 10
4: 4  8
5: 5 12
  • Is \(r>0, r<0,\) or, \(r=0\)?

Correlation: Calculation

  • Steps to calculate?
  • \(cov(x,y)=E[(x-\bar x)(y-\bar y)]\):
    • Calculate the mean of x and of y
    • subtract the mean from each observation
    • multiply the two results
    • take the average
  • Once we have covariance, standardize to get \(r\) ( or \(\rho\))
    • \(cov(x,y)/(\sigma_x\sigma_y)\)
    • \(\sigma\) calculated as before: subtract out the mean from each observation, square it, and take the average

Correlation: Calculation

   x  y xbar ybar x-xbar y-ybar product
1: 1  3    3  7.6     -2   -4.6     9.2
2: 2  5    3  7.6     -1   -2.6     2.6
3: 3 10    3  7.6      0    2.4     0.0
4: 4  8    3  7.6      1    0.4     0.4
5: 5 12    3  7.6      2    4.4     8.8
cov(x,y):  4.2
   x  y xbar ybar x-xbar y-ybar product devx  devy
1: 1  3    3  7.6     -2   -4.6     9.2    4 21.16
2: 2  5    3  7.6     -1   -2.6     2.6    1  6.76
3: 3 10    3  7.6      0    2.4     0.0    0  5.76
4: 4  8    3  7.6      1    0.4     0.4    1  0.16
5: 5 12    3  7.6      2    4.4     8.8    4 19.36
SD x:  1.414214 
SD y:  3.261901 
r:  0.9104655

Correlation: Observation

  • Once we subtract out the mean of x and y, we are always on the standard x-y plane.
  • We can now calculate the contribution to covariance by looking at x*y for each point.
  • An observation at point (x,y) after subtracting \(\bar x, \bar y\) will always contribute the same amount to the covariance
  • Points near y=x or y=-x have the largest impact, and points that are further from the center

Correlation: Observation

  • More extreme observations (outliers) are highly influential
  • Note exactly on y=x may contribute little if close to (0,0), but they also contribute little to both \(\sigma_x\) and \(\sigma_y\), so \(\rho\) may not be affected much