Administrative Miscellanea
- Sign up for iClicker (see blackboard for link)
- Datacamp available for free (optional but useful for learning R):
- See Blackboard for link
- I picked out some relevant courses that are “due” December 1st
Review: Linear Equations
- In 1 dimension given by \(y=mx+b\)
- x is the independent variable
- y is the dependent variable
- m is the slope
- b is the y intercept
- (-b/m) is the x intercept, but usually not important in econometrics
Linear Equations: Examples
Linear Equations: Higher Dimensions
- In multiple dimensions will be given by \(y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n\)
- \(\beta_0\) is the intercept, y is the dependent variable
- Now we have n dependent variables, \(x_n\), each with their own slope \(\beta_n\)
- If you go on into higher classes you’ll interpret each \(x_i\) as a column vector, and \(X\) as a matrix
Higher Dimensions: Example
Linear Equations: Question
![]()
- What is the formula for this line?
Answer:
- \(y = mx + b\)
- \((0,7)\) gives y intercept: \(b=7\)
- \((0,7) \to (1,5)\) gives slope: \(m=-2\) (we can use any 2 points by taking \(\frac{\Delta y}{\Delta x}\))
- \(y=-2x+7\)
Data types
The names/categories aren’t important to know, but the interpretation is. This will show up frequently in class :
- Interval: a 1 unit change is the same everywhere (dollars)
- Ordinal: higher numbers are better, but the difference isn’t easily interpretable (e.g. 8 vs 12 vs 16 years of school)
- Categorical: the number is just an identifier (zip code)
- A 10% increase in revenue has a very different interpretation from a 10% increase in education or 10% “increase” in zip code. This should always be context specific.
Data strutures
- Cross-sectional: A snapshot of data at a given time.Students grades in this classroom at end of semester.
- Time Series: Study 1 unit over time. Your grade every week in this class.
- Panel Data : Combination of the above two. A cross section of grades every week in this class.
- There are other distinctions based on whether the same individuals are followed over time
- These will determine the subscripts of our regression equations later in the course
Data structures: Cross Section
Attendance vs grades, simulated from anonymized data
student baseline attendance grade
1: 1 0.9507893 0.9657565 110.46481
2: 2 0.7784678 1.0628186 80.41383
3: 3 1.0457980 1.0900670 112.30689
4: 4 1.1704702 0.8975915 87.35327
5: 5 0.9979116 0.3249460 72.55599
---
205: 205 1.0530886 1.0166823 84.04216
206: 206 1.0585210 0.4560729 82.68328
207: 207 1.0844113 0.8926517 90.41293
208: 208 1.0179260 1.2323755 74.08424
209: 209 0.6371130 0.8295373 74.87772
Data Structures: Cross Section Plot
Data Structures: Time Series
Date Close
1: 2021-11-01 402.8633
2: 2021-11-02 390.6667
3: 2021-11-03 404.6200
4: 2021-11-04 409.9700
5: 2021-11-05 407.3633
---
295: 2023-01-03 108.1000
296: 2023-01-04 113.6400
297: 2023-01-05 110.3400
298: 2023-01-06 113.0600
299: 2023-01-09 119.7700
Data Structures: Time Series Screenshot
Data Structures: Panel Screenshot
Data is simulated from actual anonymized data
student week attendance grade Roster TA Reason
1: 97 11 1 24.3964171 2 ThatOtherGuy
2: 181 10 1 24.5029410 6 ThatGuy
3: 83 5 1 25.0527665 5 ThatOtherGuy
4: 87 1 1 21.1215230 6 ThatGuy
5: 157 5 1 23.6417363 2 ThatOtherGuy
---
2306: 115 3 1 21.6190862 6 ThatGuy
2307: 150 5 1 0.6772831 4 ThatGuy
2308: 31 7 1 24.4333309 1 ThatGuy
2309: 175 9 1 21.5432014 4 ThatGuy
2310: 109 11 0 25.1429089 2 ThatOtherGuy
Populations vs Samples
- For a population, a measure of interest is called a parameter (average income in the US)
- When data is limited we need to estimate this statistic using a sample
- e.g. mean individual income in the US is $57,143 in 2021 (from FRED St Louis). If we randomly sample 1000 individuals we may end up getting an average of $51,000 in this specific sample. Here \(\mu=57143, \hat\mu =51000\)
Common Sample Statistics: mean
- An individual observation of outcome x is labeled as \(x_i,\ i=1,2,...,n\)
- You could also label them, e.g. \(income_{sam}, income_{fred}\), etc.
- Mean: \(E[X]=\mu\) for population, \(\bar x=\frac{x_1+x_2+...+x_n}{n}\) for sample
- E[] refers to the expectation, or average, of a random variable
- The mean is the average, or center, of the distribution
Sample Statistics Question:
x
1: 1
2: 5
3: -7
4: 12
5: 15
Sample Statistics Answer
- \(x_1=1,x_2=5,x_3=-7,x_4=12,x_5=15\)
- \(\sum x_i = x_1+x_2+...+x_5=26\)
- \(n=5 \implies \bar x=26/5=5.2\)
- If this is a population we call this \(\mu=E[X]\) instead of \(\bar x\)
Common Sample Statistics: Variance
- Variance: \(\sigma^2=E[(x-\mu)^2]\) for population
- \(=\frac{(x_1-\bar x)^2+ (x_2-\bar x)^2 + ... + (x_n-\bar x)^2}{(n)}\)
- Where \(\bar x\) was calculated as before
- This is the average distance from the mean
- This measures the spread of a distribution, rather than the center
Common Sample Statistics: Variance
- For Samples: \(s^2= \frac{(x_1-\bar x)^2+ (x_2-\bar x)^2 + ... + (x_n-\bar x)^2}{(n-1)}\) for sample
- We divide by n-1 instead of n to make the result unbiased. This only matters for small samples.
- Unbiased means that on average \(s^2=\sigma^2\) (ie if we repeat an experiment many times)
- Standard deviation: \(\sigma=\sqrt(\sigma^2),s=\sqrt(s^2)\)
Calculating Standard Deviation
x
1: 1
2: 5
3: -7
4: 12
5: 15
- Same dataset, but now we want to calculate standard deviation. Steps?
Calculating Standard Deviation
- First calculate variance, the average (squared) distance from the mean:
- subtract \(\bar x\) from each observation
- Square the result
- Take the mean of this result
Calculating Standard Deviation: Tabular Calculation
x xbar difference squaredDifference
1: 1 5.2 -4.2 17.64
2: 5 5.2 -0.2 0.04
3: -7 5.2 -12.2 148.84
4: 12 5.2 6.8 46.24
5: 15 5.2 9.8 96.04
- This is the sum of squared differences. To get the variance divide by \(n=5\) if a population (\(\sigma^2\)), or \(n=4\) if a sample (\(s^2\))
- \(\sigma^2=61.76, s^2=77.2\)
- Finally, squared units are weird, so take the square root to get the standard deviation (\(\sigma\) or \(s\))
A note on higher moments
- The information in the variance and standard deviation is captured in \(E[X^2]\).
- \(E[X^3]\) gives the skewness of a distribution
- \(E[X^4]\) gives the tail weight (kurtosis)
- Given \(E[X], E[X^2],...E[X^n]\) for n from 1 to \(\infty\) fully specifies a distribution
- There are technically pathological exceptions to this
Random variables
- Our sample statistic, e.g. \(\bar x\) will change every time we use a different sample. We call this a random variable.
- More specifically, each individual observation, \(x_i\) is a random variable. Our sample statistic is then a transformation of this vector of \(x_i\)s: \(\bar x = (x_1+x_2 + ... x_n)/n\)
- The sampling distribution described above will be different from the population distribution (each \(x_i\) is just a random draw from the population distribution)
Random Variable Representation: Dice
- Suppose we toss a six-sided die. How can we represent this event mathematically?
- We can specify all possible outcome and their associated probability
- outcomes: {1,2,3,4,5,6}
- probabilities: {\(\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6}\)}
- How to visualize?
Random Variable Visualization: Histograms
Adding random variables: Dice
- A roll of a 6 sided die is a random variable. it is specified by associating a probability of \(\frac{1}{6}\) to the outcomes \(X=1,X=2,...X=6\)
- If X is the event of a roll of 1 die, and Y of a second die, then \(X+Y\) gives the probability associated with the outcome of rolling two dice
- What is the probability that \(X+Y=7\)?
- Calculation is called a convolution. Notice the symmetry!
Adding random variables: histogram
Adding random variables: Expectation
- Adding random variables is hard, but their mean is easy to calculate
- Single die has mean \(\frac{1+2+3+4+5+6}{6}=3.5\)
- Two dice (\(X_1+X_2\)) has mean 7
- Similarly, \(E[X^2]\) is easy to compute
- Variance/sd can be easily calculated given both \(E[X]\) and \(E[X^2]\)
Adding random variables: Normal Distribution
- One random variable let’s us easily calulate a mean (and sum) of random variables:
- \(f(x)=\frac{1}{\sqrt{2\pi}}e^{\frac{(\frac{x-\mu}{\sigma})^2}{2}}\)
- This is a normal random variable with mean \(\mu\) and standard deviation \(\sigma\)
- Written as \(X \sim N(\mu,\sigma)\)
- \(\frac{X_1+X_2+...+X_n}{n} \sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
Normal Distribution Usage
- Given a normal distribution with mean 0 and standard deviation 1, what is the probability that \(-1<x<1\)?
- Calculate using the area under the curve from -1 to 1
- Can’t do by hand. Use either a table or the function pnorm in R (\(\Phi(z)\))
- 68% of data lies between -1 and +1 standard deviation
- 95% of data is between 2 standard deviations of mean, and 99.7% within 3 standard deviations
Normal distribution plot
Interpreting Mean and SD
The following is a normal distribution with mean 1 and standard deviation 1. What will the graph look like if instead we have mean 2 and standard deviation 1/2?
Interpreting Mean and SD - Result
Random variables: Population vs Sampling: graph
https://mcfortran.shinyapps.io/sampling/
Sampling Moments
- Sampling distributions have their own mean and variance (and other moments)
- \(E[\bar x]=\mu\)
- \(var(\bar X) = \frac{var(X)}{n}=\frac{\sigma^2}{n}\)
- \(sd(\bar x)=\frac{\sigma}{\sqrt n}\)
- How to calculate?
Sampling Moments
- as \(n\to\infty\) \(\bar x\) approaches a distribution with mean \(\mu\) and variance 0
- We can state this a bit more precisely using the central limit theorem
- Caveat: we have assumed our samples are drawn independently. If observations are related to each other this is violated
Central Limit Theorem
- If we have independent, identically distributed (iid) random samples from a population with finite variance the Central Limit Theorem applies:
- \(lim_{n\to\infty} \bar x=N(\mu,\frac{\sigma}{\sqrt{n}})\)
- Once samples get large our sampling distribution becomes normally distributed with declining variance, regardless of the shape of our population distribution
- Many consider this to be the most beautiful theorem in mathematics
Central Limit Theorem - Illustration
https://mcfortran.shinyapps.io/sampling/
Central Limit Theorem - Intuition 1
- When we added dice, we had natural symmetry. If our distributions are identical this will always arise
- And if they are independent then we just need to multiply the probabilities piecewise
- So i.i.d. distributions should result in this symmetry
Central Limit Theorem - Intuition 2
- Moments (\(E[X^n]\)) had easy math. So calculate every moment and translate this back to probabilities
- This is done with a moment generating function: the laplace transform of the probability function
- \(M_X(t)=E[e^{tx}] \implies E[X^n]=\frac{\partial^n M_X(t)}{\partial t^n}|_0\)
- Far beyond this course, but a common technique in graduate level math
- This is how you formally prove the central limit theorem
Central Limit Theorem - Intuition 3
- The normal distribution stays normal when taking averages
- The normal distribution is the only finite variance distribution with this property
- So any non-normal distribution will end up gravitating towards a normal distribution
- Formula: \(e^{-x^2}\) is its own fourier transform (related to the laplace transform)
- The fourier transform makes convolutions easier to calculate
Cross Moments: covariance and correlation
- For two random variables (or columns of data) the average product of the two, \(E[XY]\), captures important information
- Like variance, we subtract out the mean to make it more interpretable
Cross Moments: covariance and correlation
- We define \(cov(X,Y)=E[(x-\mu_x)(y-\mu_y)]\)
- If X and Y are both above their mean, the product is positive. If they’re both below the mean it’s also positive
- If one value is above the mean and the other is below their product is negative
- This then gives a measure of how closely associates X and Y are. If they move in the same direction it is positive, opposite directions is negative. “Independent” is zero
Cross moments: correlation
- Standardize further: standardize covariance to be in the range \([-1,1]\) by dividing by \(\sigma_x\sigma_y\). Called correlation
- \(\rho\equiv cov(X,Y)/(\sigma_x\sigma_y)\)
- The sample statistic for \(\rho\) is usually called \(r\) instead of \(\hat\rho\)
Correlation: visualization
https://mcfortran.shinyapps.io/correlation/
Correlation: Calculation
x y
1: 1 3
2: 2 5
3: 3 10
4: 4 8
5: 5 12
![]()
- Is \(r>0, r<0,\) or, \(r=0\)?
Correlation: Calculation
- Steps to calculate?
- \(cov(x,y)=E[(x-\bar x)(y-\bar y)]\):
- Calculate the mean of x and of y
- subtract the mean from each observation
- multiply the two results
- take the average
- Once we have covariance, standardize to get \(r\) ( or \(\rho\))
- \(cov(x,y)/(\sigma_x\sigma_y)\)
- \(\sigma\) calculated as before: subtract out the mean from each observation, square it, and take the average
Correlation: Calculation
x y xbar ybar x-xbar y-ybar product
1: 1 3 3 7.6 -2 -4.6 9.2
2: 2 5 3 7.6 -1 -2.6 2.6
3: 3 10 3 7.6 0 2.4 0.0
4: 4 8 3 7.6 1 0.4 0.4
5: 5 12 3 7.6 2 4.4 8.8
x y xbar ybar x-xbar y-ybar product devx devy
1: 1 3 3 7.6 -2 -4.6 9.2 4 21.16
2: 2 5 3 7.6 -1 -2.6 2.6 1 6.76
3: 3 10 3 7.6 0 2.4 0.0 0 5.76
4: 4 8 3 7.6 1 0.4 0.4 1 0.16
5: 5 12 3 7.6 2 4.4 8.8 4 19.36
Correlation: Observation
- Once we subtract out the mean of x and y, we are always on the standard x-y plane.
- We can now calculate the contribution to covariance by looking at x*y for each point.
- An observation at point (x,y) after subtracting \(\bar x, \bar y\) will always contribute the same amount to the covariance
- Points near y=x or y=-x have the largest impact, and points that are further from the center
Correlation: Observation
![]()
- More extreme observations (outliers) are highly influential
- Note exactly on y=x may contribute little if close to (0,0), but they also contribute little to both \(\sigma_x\) and \(\sigma_y\), so \(\rho\) may not be affected much