October 13, 2016

Topics

  • multicollinearity
  • homoskedasticity
  • omitted variable bias
  • confidence intervals and t-tests

Also: expect a new computer assignment next week.

Recap of OLS assumptions for Multiple Regression

  1. Linear in parameters
  2. Random sampling
  3. No perfect multicollinearity
  4. Zero conditional mean (\(E(u|x) = 0\))
  5. Homoskedasticity
  • 1,2, and 4 should look familiar already.

Multicollinearity

Two variables are collinear when they are highly correlated. If \(x_1\) and \(x_2\) are perfectly correlated we are essentially estimating:

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_1 x_1 + ... + \beta_k x_k + \hat{u}\]

This is impossible, and R will automatically drop one of the collinear variables.

Multicollinearity

On way to (accidentally) get perfect multicollinearity is the dummy variable trap.

\[wage = \beta_0 + \beta_1 male + \beta_2 female + \beta_3 age + u\]

How would we interpret \(\beta_1\) and \(\beta_2\)?

Multicollinearity

Another example could be the same variable measured in different units.

  • How much you complain about flying is certainly a function of how many inches tall you are, but
  • you can't figure out that effect if you try to hold constant your height in centimeters.

Multicollinearity in practice

In practice you'll often have two or more variables that are imperfectly collinear.

In this situation your regression will probably be fine overall, but the collinear coefficients could behave oddly + Running the same model on a new sample might result in very different estimates for your collinear variables.

If \(x_2\) and \(x_3\) are collinear but you're interested in \(x_1\), you'll be fine.

Homoskedasticity

We face Heteroskedasticity when the variance of the error term isn't constant for different combinations of our \(x\) variables.

(Remember: hetero means different, homo means the same. Homogonized milk is consistent, heterogonized milk is cottage cheese.)

Homoskedasticity

For example, the variance of the errors for this regression increases with the size of the bill.

Homoskedasticty

  • When we have homoskedasticity (same residual variance throughout) our estimates of standard error will be accurate.

  • When we have heteroskedasticity (residual variance that changes depending on the level of at least one of the \(x\) variables) our estimates of standard error will be inaccurate.
    • but our \(\beta\) estimates are still unbiased!

Later in the semester we will see ways to deal with heteroskedasticity.

Omitted variable bias

We're estimating the relationship between \(y\) and \(x\) while assuming that the Gauss-Markov conditions hold.

An important way they might not hold is if our error term isn't well behaved because we've omitted an important variable.

Omitted variable bias

You estimate that a school's average SAT score increases by 20 points as per student spending increases by $100

Can you conclude that richer schools are better schools?

Omitted variable bias

Richer schools probably also have richer parents. Richer parents are more likely to pay for test prep.

If you don't control for this, your estimate of the effects of school spending will be biased upward.

Overview of Chapter 3

Quick Review

  • Multiple linear regression let's us see the effect of \(x_i\) on \(y\) holding constant \(x_j\) (where \(x_j\) is any other variable you included in your model).

  • The model is linear in the parameters but we can approximate non-linear effects with the right (in)dependent variables (e.g. \(ln(y)\) or \(x^2\)).

Quick Review

  • Ordinary Least Squares (OLS) provides an estimate for slope parameters that represent the partial effect of a change in a variable (i.e. holding constant all the other independent variables)

  • We can measure how well our model fits the data by calculating the \(R^2\) which is a ratio showing how much of the total variation our model explains. A higher \(R^2\) is not always a good thing.

Quick Review

Gauss-Markov Assumptions (1-4):

  1. Linear in parameters
  2. Random sampling
  3. No perfect multicollinearity
  4. Zero conditional mean (\(E(u|x) = 0\))

If these assumptions hold, our OLS estimators are unbiased–they will tend to reflect the true relationship between our \(x\) variables and \(y\).

Number 4 implies that we've correctly specified the model so that (for example) \(u\) isn't capturing the effect of a variable that matters but we left out of the model.

  • Including irrelevant variables won't bias our estimates, but the model is likely to be less useful.

Quick Review

The fifth Gauss-Markov Assumption: homoskedasticity.

Without homoskedasticity, our estimates of the standard error become unreliable and it becomes harder to do hypothesis testing.

  • When all five assumptions hold, OLS estimators are BLUE
    • Best Linear Unbiased Estimator

Hypothesis testing and confidence intervals

Normality assumption

We're adding an extra assumption: not only is \(u\) independent of our \(x\) variables, but it's normally distributed.

\[u \sim N(mean = 0,variance = \sigma^2)\]

This is a good starting point because it makes hypothesis testing relatively easy.

Recall: the factors we're not explicitly including in our model get rolled into the error term. If these sources of randomness are additive then we can appeal to the Central Limit Theorem to justify this assumption.

Two types of estimates

  • Usually we're trying to estimate a specific number
    • How many black bears are there in California? About 35000.
    • How many roads must a man walk down? 42
    • How tall is Mt. Everest? 29029 feet
  • These are point estimates.

Two types of estimates

  • But sometimes instead of a single best guess, we want a range of possibilities.
    • There are between 28000 and 42000 bears.
    • A man must walk down 30 to 54 roads.
    • Everest is betwen 27000 and 30000 feet.
  • These are interval estimates.

Confidence intervals

## 
## Call: lm(formula = mpg ~ hp + wt + disp, data = mtcars)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.105505   2.110815  17.579  < 1e-04 ***
## hp          -0.031157   0.011436  -2.724  0.01097 *  
## wt          -3.800891   1.066191  -3.565  0.00133 ** 
## disp        -0.000937   0.010350  -0.091  0.92851    
## 
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8083 
## F-statistic: 44.57 on 3 and 28 DF,  p-value: < 1e-04
  • When we estimate that a car being 1 ton heavier reduces gas milage by 3.800891 miles per gallon, we don't really believe that's the true effect, just our best guess… it's our point estimate

Confidence interval

  • \(\hat{\beta_i}\) is a random variable that we hope will be similar to the true value of \(\beta_i\).
  • It's very unlikely that we've found the exact value of \(\beta_i\), but we're probably pretty close.
  • The confidence interval shows a range of possible values for the \(\beta\) we're estimating that we feel confident covers the true value.
    • e.g. I'm 95% sure that Eli is between 5'6" and 5'11".

Confidence intervals

confint(fit1)
##                   2.5 %       97.5 %
## (Intercept) 32.78169625 41.429314293
## hp          -0.05458171 -0.007731388
## wt          -5.98488310 -1.616898063
## disp        -0.02213750  0.020263482
confint(fit1,2:4,0.99)
##            0.5 %        99.5 %
## hp   -0.06275665  0.0004435502
## wt   -6.74705515 -0.8547260192
## disp -0.02953607  0.0276620523

Side note on probability.

  • There is not a 0.95 probability that the true effect of weight on gas milage is somewhere between -5.98 mpg/ton and -1.62 mpg/ton.
  • The true effect either is or isn't in that range. (p = 1 or 0, not 0.95)
  • If we repeated our "experiment" many times, 95% of our confidence intervals will include the true value of the parameter we're estimating.

The t-test

Recall from earlier:

\[t = \frac{\beta_i - \beta_{NULL}}{se(\beta_i)}\]

Where the standard error is our estimate of the standard deviation of \(\beta_i\).

The t distribution let's us figure out the probability of our estimate under the null hypothesis.

The t-test in reverse

We can also start with a specific probability (e.g. 95%) and find the values of \(t\) that correspond with that probability. Those values (e.g. -2 and +2) tell us how many standard deviations wide our confidence interval needs to be.

Critical t values

For 28 degrees of freedom, and 95% confidence, what are the critical values of \(t\)?

qt(.975,28)
## [1] 2.048407
qt(.025,28)
## [1] -2.048407

If we multiply those numbers by the standard error of the wt coefficient (1.066191) we find out how far above or below the coefficient estimate we go to make our confidence interval.

Critical t values

qt(.975,28) * 1.066191
## [1] 2.183993

We estimate that an extra ton of weight reduces gas milage by 3.8 mpg (point estimate), but we're 95% "confident" the true effect is 3.8 mpg give or take 2.18 mpg (interval estimate).

##                   2.5 %       97.5 %
## (Intercept) 32.78169625 41.429314293
## hp          -0.05458171 -0.007731388
## wt          -5.98488310 -1.616898063
## disp        -0.02213750  0.020263482

The standard error

We know how the t-test works, but let's see how the computer is calculating the standard error…

Review question: what is the relationship between variance and standard deviation?

Standard error

\[ var(\hat{\beta_j}) = \frac{\sigma^2}{SST_j(1-R^2_j)}\]

\(\hat{\beta_j}\) is a random variable because it varies based on what sample we (randomly) pick. This equation calculates the variance of this random variable.

\(\sigma^2 = var(y|x)\)… the conditional variance of y.

Standard error

\[ var(\hat{\beta_j}) = \frac{\sigma^2}{SST_j(1-R^2_j)}\]

\(SST_j = \sum_{i=1}^n (x_{ij} - \bar{x_j})^2\)… divided by \(n-1\) would give us the variance of this \(x\) variable. In this case we can think of it as total variation in \(x_j\).

\(R^2_j\) is the \(R^2\) (ratio of explained variation) from a regression of this \(x\) variable (\(x_j\)) on the other \(x\) variables (\(x_i, i\neq j\)) in our model.

Standard error

\[ var(\hat{\beta_j}) = \frac{\sigma^2}{SST_j(1-R^2_j)}\]

Notice:

  • larger \(\sigma^2\) means noisier data and greater variance for all the \(\beta\) coefficients.
  • more variation in \(x_j\) reduces the variance of \(\beta_j\)
  • If \(x_j\) is highly correlated with the other \(x\) variables, we get a higher variance for \(\beta_j\). (Multicollinearity!)

Example: wt

This is what's going on under the hood when R calculates your standard errors…

fitwt <- lm(wt ~ hp + disp, mtcars) # To get R^2_wt
R2wt <- summary(fitwt)$r.squared
SSTwt <- with(mtcars,sum((wt - mean(wt))^2)) # SS deviations from mean
sigma2 <- summary(fit1)$sigma^2
sigma2/(SSTwt*(1-R2wt)) # variance of beta_wt
## [1] 1.136762
(sigma2/(SSTwt*(1-R2wt)))^0.5 # standard error
## [1] 1.066191

Example wt

## 
## Call: lm(formula = mpg ~ hp + wt + disp, data = mtcars)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.105505   2.110815  17.579  < 1e-04 ***
## hp          -0.031157   0.011436  -2.724  0.01097 *  
## wt          -3.800891   1.066191  -3.565  0.00133 ** 
## disp        -0.000937   0.010350  -0.091  0.92851    
## 
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8083 
## F-statistic: 44.57 on 3 and 28 DF,  p-value: < 1e-04

Review: the t-test

\[t = \frac{\beta_i - \beta_{NULL}}{se(\beta_i)}\]

Under the null hypothesis that \(x_i\) doesn't, in fact, affect \(y\), we test whether our observed \(\hat{\beta_i}\) is in fact different from 0.

We compare the t-statistic with the t distribution for the appropriate number of degrees of freedom (\(n-k-1\)).

That gives us the p-value–the probability of finding \(\hat{\beta_i}\) if the null hypothesis is true.

Troubles with p-values

Troubles with p-values

  • Low p-values imply high statistical significance.
  • Low p-values do not mean it's actually important.
  • Large samples are likely to yield low p-values, but the magnitude of the effect might be tiny (insignificant)
  • Example: A p-value might be 0.000001 for a coefficient that tells us a trillion dollar increase in toothpick expenditures would reduce unemployment by .1%
    • Statistically significant but economically insignificant.

P-hacking in the news

Lesson:

Even people trained in statistics can be fooled by randomness.

Selection effects can lead to weird outcomes.

Robustness checks

Robustness checks

  • Any one estimate should be taken with a grain of salt.

  • When studying something you will try several different specifications, sometimes different datasets, etc.

  • But you'll ultimately report on just a handful of models.

Robustness checks

  • You should give your audience some indication of what the rest of the models looked like.
    • Do your results hold up when outliers are excluded?
    • What about under different model specifications?
  • It's better to say
    • "I find some evidence of X but the relationship doesn't always hold up," than
    • "I find evidence of X, please don't look behind the curtain."