Review: Linear Equations
- In 1 dimension given by \(y=mx+b\)
- x is the independent variable
- y is the dependent variable
- m is the slope
- b is the y intercept
- (-b/m) is the x intercept, but usually not important in econometrics
Linear Equations: Examples
![]()
- y intercept?
- Slope?
- Equation?
Slope and intercept interpretation
You rent a car. You estimate the cost per day to rent and operate the car is given by the equation
- \(Price = 30 + 0.2 * miles\)
Interpret the slope and y intercept in words.
NOTE: “Interpret” in this class always refers to a basic explanation of a formula or coefficient which directly uses the numeric values and that a (reasonably) normal person can understand
Linear Equations: Higher Dimensions
- In multiple dimensions will be given by \(y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k\)
- \(\beta_0\) is the intercept, y is the dependent variable
- Now we have k dependent variables, \(x_k\), each with their own slope \(\beta_k\)
(Optional) Matrix multiplication Example
\(\begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix} = \begin{bmatrix}1 & x_{11} & x_{21} \\ 1 & x_{12} & x_{22}\\ 1 & x_{13} & x_{23}\end{bmatrix} * \begin{bmatrix}\beta_0 \\ \beta_1 \\ \beta_2\end{bmatrix}\)
\(\begin{bmatrix}y_1=\beta_0 + \beta_1 x_{11} + \beta_2 x_{21} \\ y_2 = \beta_0 + \beta_1 x_{12} + \beta_2 x_{22} \\ y_3 = \beta_0 + \beta_1 x_{13} + \beta_2 x_{23}\end{bmatrix}\)
where, e.g. \(x_{21}\) is \(x_2\) for individual 1
Dropping subscripts: \(y=\beta_0+\beta_1x_1+\beta_2x_2\)
Higher Dimensions: Example
Linear Equations: Question
![]()
- What is the formula for this line?
Data types
The names/categories aren’t important to know, but the interpretation is. This will show up frequently in class :
- Interval: a 1 unit change is the same everywhere (dollars)
- Ordinal: higher numbers are better, but the difference isn’t easily interpretable (e.g. you code 0 for a HS dropout, 1 for HS degree, 2 for some college, 3 for bachelor’s degree, etc)
- Categorical: the number is just an identifier with no numeric meaning (zip code)
- A 10% increase in revenue has a very different interpretation from a 10% increase in education or zip code.
Data types - a binary variable
You have data on employees at your company. You find the following average relationship between gender (0 for male and 1 for female) and height (in cm):
\(height = 176 - 14*gender\)
Interpret the slope of -14 in this context
Data strutures
- Cross-sectional: A snapshot of data at a given time.Students’ grades in this classroom at end of semester.
- Time Series: Study 1 unit over time. Your grade every week in this class.
- Panel Data : Combination of the above two. Each student’s grade every week.
- There are other distinctions based on whether the same individuals are followed over time
- These will determine the subscripts of our regression equations later in the course
Data structures: Cross Section
Attendance vs grades, simulated from anonymized data
student baseline attendance grade
1: 1 0.9507893 0.9657565 110.46481
2: 2 0.7784678 1.0628186 80.41383
3: 3 1.0457980 1.0900670 112.30689
4: 4 1.1704702 0.8975915 87.35327
5: 5 0.9979116 0.3249460 72.55599
---
205: 205 1.0530886 1.0166823 84.04216
206: 206 1.0585210 0.4560729 82.68328
207: 207 1.0844113 0.8926517 90.41293
208: 208 1.0179260 1.2323755 74.08424
209: 209 0.6371130 0.8295373 74.87772
Data Structures: Cross Section Plot
Data Structures: Time Series
Date Close
1: 2021-11-01 402.8633
2: 2021-11-02 390.6667
3: 2021-11-03 404.6200
4: 2021-11-04 409.9700
5: 2021-11-05 407.3633
---
295: 2023-01-03 108.1000
296: 2023-01-04 113.6400
297: 2023-01-05 110.3400
298: 2023-01-06 113.0600
299: 2023-01-09 119.7700
Data Structures: Time Series Screenshot
Data Structures: Panel Screenshot
Data is simulated from actual anonymized data
student week attendance grade Roster TA Reason
1: 147 9 0 24.21382 6 ThatGuy Canceled
2: 110 5 1 24.58975 2 ThatOtherGuy
3: 209 2 1 20.23604 1 ThatGuy
4: 50 10 1 24.95760 5 ThatOtherGuy
5: 165 10 0 24.43024 6 ThatGuy
---
2306: 82 8 1 18.87512 4 ThatGuy
2307: 127 2 1 23.49707 2 ThatOtherGuy
2308: 122 10 1 24.45239 3 ThatOtherGuy
2309: 6 7 0 25.04620 5 ThatOtherGuy
2310: 92 2 0 23.28823 1 ThatGuy
Panel Data
Populations vs Samples
- When conducting research we need to define the population we’re interested in
- For average salary by educational attainment, do we include 16 year olds? 70 year olds? Unemployed individuals? Part-time?
- For a population, a measure of interest is called a parameter (average income in the US)
- When data is limited we need to estimate this statistic using a sample
Populations vs Samples
- mean individual income in the US is $57,143 in 2021 (from FRED St Louis).
- If we randomly sample 1000 individuals we may end up getting an average of $51,000 in this specific sample.
- Here \(\mu=57143, \hat\mu =51000\)
- A hat is used to indicate an estimate
- parameters are true values, statistics are estimates
Common Sample Statistics: mean
- An individual observation of outcome x is labeled as \(x_i,\ i=1,2,...,n\)
- You could also label them, e.g. \(income_{sam}, income_{fred}\), etc.
- Mean: \(E[X]=\mu\) for population, \(\bar x=\frac{x_1+x_2+...+x_n}{n}\) for sample
- E[] refers to the expectation, or weighted average, of a random variable
- The mean is the average, or center, of the distribution
Sample Statistics Question:
x
1: 1
2: 5
3: -7
4: 12
5: 15
Common Sample Statistics: Variance
- Variance: \(\sigma^2=E[(x-\mu)^2]\) for population
- \(=\frac{(x_1-\bar x)^2+ (x_2-\bar x)^2 + ... + (x_n-\bar x)^2}{n}\)
- Where \(\bar x\) was calculated as before
- This is the average squared distance from the mean
- This measures the dispersion, or spread, of a distribution, rather than the center
Common Sample Statistics: Variance
- For Samples: \(s^2= \frac{(x_1-\bar x)^2+ (x_2-\bar x)^2 + ... + (x_n-\bar x)^2}{n-1}\) for sample
- We divide by n-1 instead of n to make the result unbiased. This only matters for small samples.
- Unbiased means that on average \(s^2=\sigma^2\) (ie if we repeat an experiment many times)
- Standard deviation: \(\sigma=\sqrt\sigma^2,s=\sqrt s^2\)
Calculating Standard Deviation
x
1: 1
2: 5
3: -7
4: 12
5: 15
- Same dataset, but now we want to calculate standard deviation. Steps?
Calculating Standard Deviation
- First calculate variance, the average (squared) distance from the mean:
- subtract \(\bar x\) from each observation
- Square the result
- Take the mean of this result
Calculating Standard Deviation: Tabular Calculation
x xbar difference squaredDifference
1: 1 5.2 -4.2 17.64
2: 5 5.2 -0.2 0.04
3: -7 5.2 -12.2 148.84
4: 12 5.2 6.8 46.24
5: 15 5.2 9.8 96.04
- This is the sum of squared differences. To get the variance divide by \(n=5\) if a population (\(\sigma^2\)), or \(n=4\) if a sample (\(s^2\))
- \(\sigma^2=61.76, s^2=77.2\)
Calculating Standard Deviation: Tabular Calculation
- Finally, squared units are weird, so take the square root to get the standard deviation (\(\sigma\) or \(s\))
A note on higher moments
- The information in the variance and standard deviation is captured in \(E[X^2]\), the second moment.
- \(E[X^3]\) gives the skewness of a distribution
- \(E[X^4]\) gives the tail weight (kurtosis)
- Given \(E[X], E[X^2],...E[X^n]\) for n from 1 to \(\infty\) fully specifies all “well-behaved” distributions
- Analogous to Taylor’s theorem
Random Variables
- A random variable is used to represent a random event
- The variable itself is typically denoted as a capital letter, and the outcome a lowercase one
- \(X\) is the roll of a die. \(X=5\) means a specific roll was a 5. \(X=x\) means a specific roll was \(x\)
- Random Variable Operations:
- Measure the probability of a specific event: \(P(X=5)\)
- Measure the average value \(E[X]\)
- Transform them: \(X^2\) is the value of a squared dice roll. \(X+Y\) is the sum of 2 dice
Random Variable Representation: Dice
- Suppose we toss a six-sided die. How can we represent this event mathematically?
- We can specify all possible outcome and their associated probability
- outcomes: {1,2,3,4,5,6}
- probabilities: {\(\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6}\)}
- How to visualize?
Random Variable Visualization: Histograms
Discrete Random Variables
- If a random variable has a countable number of outcomes (like a dice roll or a coin flip) it is called discrete
- The function that generates the histogram is called the probability mass function, \(p(x)\). \(\sum p(x) = 1\)
- Two events are independent if their joint probability is the product of their individual probabilities
- \(P(X \cap Y) = P(X) * P(Y)\)
- Means that knowing the value of one random variable gives us no information about the other
- Extremely strong assumption!
Adding random variables: Dice
- A roll of a 6 sided die is a random variable. \(p(x)=\frac{1}{6}, x\le 6\ \, x \in\mathbb{N}\)
- If X is the event of a roll of 1 die, and Y of a second die, then \(X+Y\) gives the probability associated with the outcome of rolling two dice
- What is the probability that \(X+Y=7\)?
- Calculation is called a convolution. Notice the symmetry!
Adding random variables: histogram
Adding random variables: Expectation
- Adding random variables is hard, but their mean is easy to calculate
- Weighted average: \(E[X]=\sum xp(x)\)
- Single die has mean \(\frac{1+2+3+4+5+6}{6}=3.5\)
- Expectation is linear: \(E[\sum \alpha X_i] = \alpha \sum E[X_i]\)
- Two dice (\(X_1+X_2\)) has mean \(3.5+3.5=7\)
Adding random variables: Variance
- For independent events, \(var(X+Y)=\sigma_{x+y}^2=\sigma^2_x+\sigma_y^2\)
- but \(\sigma_{x+y}\neq \sigma_x+\sigma_y\)
- and \(\sigma^2_{\alpha x} = \alpha^2 \sigma^2_x\)
- In general, \(\sigma^2_{x+y} = \sigma_x^2 +\sigma_y^2 + 2\rho\sigma_x\sigma_y\)
- \(\rho\) is the correlation coefficient
Continuous Random Variables
- Variables that take uncountable values are called continuous random variables (e.g. height, distance)
- The function that generates the histogram is called the probability density function, \(f(x)\). \(\int_{-\infty}^{\infty} f(x)dx=1\)
- the value at a single point does not give a probability. We need to take the area under the curve
- The function \(F(x)=P(X\le x)\) is called the cumulative distribution function (CDF). It is \(\int_{-\infty}^xf(t)dt\)
Adding random variables: Normal Distribution
- General formula for adding two random variables is a nightmare (don’t memorize): \(f_{x+y}(z)=\int_{-\infty}^{\infty}(\int_{-\infty}^tf(z-x)dx)f(t)dt\)
- One random variable let’s us easily calculate a mean (and sum) of random variables:
- \(f(x)=\frac{1}{\sqrt{2\pi}}e^{\frac{(\frac{x-\mu}{\sigma})^2}{2}}\)
- This is a normal random variable with mean \(\mu\) and standard deviation \(\sigma\)
Adding random variables: Normal Distribution
- If X is normal we write this as \(X \sim N(\mu,\sigma)\)
- \(\frac{X_1+X_2+...+X_n}{n} \sim N(\mu,\frac{\sigma}{\sqrt{n}})\)
- Almost no other distributions have this property (“closed under convolutions”)
Normal Distribution Usage
- Given a normal distribution with mean 0 and standard deviation 1, what is the probability that \(-1<x<1\)?
- Calculate using the area under the curve from -1 to 1
- Can’t do by hand. Use either a table or the function pnorm in R (\(\Phi(z)\))
- 68% of data lies between -1 and +1 standard deviation (=pnorm(1)-pnorm(-1))
- 95% of data is between 2 standard deviations of mean, and 99.7% within 3 standard deviations
Normal distribution plot
Interpreting Mean and SD
The following is a normal distribution with mean 1 and standard deviation 1. What will the graph look like if instead we have mean 2 and standard deviation 1/2?
Random variables: Population vs Sampling: graph
https://mcfortran.shinyapps.io/sampling/
Sample Statistics as Random variables
- Our sample statistic, e.g. \(\bar x\) will change every time we use a different sample
- Each individual observation, \(x_i\) is a random variable. Our sample statistic is then a transformation of this vector of \(x_i\)s: \(\bar x = (x_1+x_2 + ... x_n)/n\)
- The sampling distribution described above will be different from the population distribution (each \(x_i\) is just a random draw from the population distribution)
Sampling Moments
- Sampling distributions have their own mean and variance (and other moments)
- \(E[\bar x]=\mu\)
- \(var(\bar X) = \frac{var(X)}{n}=\frac{\sigma^2}{n}\)
- \(sd(\bar x)=\frac{\sigma}{\sqrt n}\)
- How to calculate the probability distribution?
Sampling Moments
- as \(n\to\infty\) \(\bar x\) approaches a distribution with mean \(\mu\) and variance 0
- We can state this a bit more precisely using the central limit theorem
- Caveat: we have assumed our samples are drawn independently. If observations are related to each other this is violated
Administrative Miscellanea
- Homework 2 due next Wednesday before class
- Problem set 1 due Friday at midnight
- Quiz 1 next Wednesday in class
- Finish review, start with bivariate OLS today
- I’ll skip on programming lab 2 that was scheduled for Monday
- Just make sure you’re set up for the first problem set
- Please come to office hours (or schedule time) if having issues
Central Limit Theorem
- If we have independent, identically distributed (iid) random samples from a population with finite variance the Central Limit Theorem applies:
- \(lim_{n\to\infty} \bar x=N(\mu,\frac{\sigma}{\sqrt{n}})\)
- Once samples get large our sampling distribution becomes normally distributed with declining variance, regardless of the shape of our population distribution
- Many consider this to be the most beautiful theorem in mathematics
Central Limit Theorem - Illustration
https://mcfortran.shinyapps.io/sampling/
Central Limit Theorem - Intuition 1
- When we added dice, we had natural symmetry. If our distributions are identical this will always arise
- And if they are independent then we just need to multiply the probabilities piecewise
- So i.i.d. distributions should result in this symmetry
- Outliers become exponentially more unlikely and smoothed around the center
Central Limit Theorem - Intuition 2
- Moments (\(E[X^n]\)) had easy math. So calculate every moment and translate this back to probabilities
- This is done with a moment generating function: the Laplace transform of the probability function
- \(M_X(t)=E[e^{tx}] \implies E[X^n]=\frac{\partial^n M_X(t)}{\partial t^n}|_0\)
- Far beyond this course, but a common technique in graduate level math
- This is how you formally prove the central limit theorem - The third moments and higher all decline very quickly, leaving only the mean and variance
Central Limit Theorem - Intuition 3
- The normal distribution stays normal when taking averages
- The normal distribution is the only finite variance distribution with this property
- So any non-normal distribution will end up gravitating towards a normal distribution
- Formula: \(e^{-x^2}\) is its own Fourier transform (related to the Laplace transform)
- The fourier transform makes convolutions easier to calculate. See convolution theorem.
Cross Moments: covariance and correlation
- For two random variables (or columns of data) the average product of the two, \(E[XY]\), captures important information
- Like variance, we subtract out the mean to make it more interpretable
Cross Moments: covariance and correlation
- We define \(cov(X,Y)=E[(x-\mu_x)(y-\mu_y)]\)
- If X and Y are both above their mean, the product is positive. If they’re both below the mean it’s also positive
- If one value is above the mean and the other is below their product is negative
- This then gives a measure of how closely associates X and Y are. If they move in the same direction it is positive, opposite directions is negative. “Independent” is zero
Cross Moments: covariance and correlation
- As long as \(cov(X,Y)\neq 0\), if we learn the value of X, we also learn something about Y
- If X and Y are independent, the covariance is 0. The reverse is not true though!
Cross moments: correlation
- Standardize further: standardize covariance to be in the range \([-1,1]\) by dividing by \(\sigma_x\sigma_y\). Called correlation
- \(\rho\equiv cov(X,Y)/(\sigma_x\sigma_y)\)
- The sample statistic for \(\rho\) is usually called \(r\) instead of \(\hat\rho\)
Correlation: visualization
https://mcfortran.shinyapps.io/correlation/
Correlation: Calculation
x y
1: 1 3
2: 2 5
3: 3 10
4: 4 8
5: 5 12
![]()
- Is \(r>0, r<0,\) or, \(r=0\)?
Correlation: Calculation
- Steps to calculate?
- \(cov(x,y)=E[(x-\bar x)(y-\bar y)]\):
- Calculate the mean of x and of y
- subtract the mean from each observation
- multiply the two results
- take the average
- Once we have covariance, standardize to get \(r\) ( or \(\rho\))
- \(cov(x,y)/(\sigma_x\sigma_y)\)
- \(\sigma\) calculated as before: subtract out the mean from each observation, square it, and take the average
Correlation: Calculation
x y xbar ybar x-xbar y-ybar product
1: 1 3 3 7.6 -2 -4.6 9.2
2: 2 5 3 7.6 -1 -2.6 2.6
3: 3 10 3 7.6 0 2.4 0.0
4: 4 8 3 7.6 1 0.4 0.4
5: 5 12 3 7.6 2 4.4 8.8
x y xbar ybar x-xbar y-ybar product devx devy
1: 1 3 3 7.6 -2 -4.6 9.2 4 21.16
2: 2 5 3 7.6 -1 -2.6 2.6 1 6.76
3: 3 10 3 7.6 0 2.4 0.0 0 5.76
4: 4 8 3 7.6 1 0.4 0.4 1 0.16
5: 5 12 3 7.6 2 4.4 8.8 4 19.36
Correlation: Observation
- Once we subtract out the mean of x and y, we are always on the standard x-y plane.
- We can now calculate the contribution to covariance by looking at x*y for each point.
- An observation at point (x,y) after subtracting \(\bar x, \bar y\) will always contribute the same amount to the covariance
- Points near y=x or y=-x have the largest impact, and points that are further from the center
Correlation: Observation
![]()
- More extreme observations (outliers) are highly influential
- Note exactly on y=x may contribute little if close to (0,0), but they also contribute little to both \(\sigma_x\) and \(\sigma_y\), so \(\rho\) may not be affected much