- multicollinearity
- homoskedasticity
- omitted variable bias
- confidence intervals and t-tests
Also: expect a new computer assignment next week.
October 13, 2016
Also: expect a new computer assignment next week.
Two variables are collinear when they are highly correlated. If \(x_1\) and \(x_2\) are perfectly correlated we are essentially estimating:
\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_1 x_1 + ... + \beta_k x_k + \hat{u}\]
This is impossible, and R will automatically drop one of the collinear variables.
On way to (accidentally) get perfect multicollinearity is the dummy variable trap.
\[wage = \beta_0 + \beta_1 male + \beta_2 female + \beta_3 age + u\]
How would we interpret \(\beta_1\) and \(\beta_2\)?
Another example could be the same variable measured in different units.
In practice you'll often have two or more variables that are imperfectly collinear.
In this situation your regression will probably be fine overall, but the collinear coefficients could behave oddly + Running the same model on a new sample might result in very different estimates for your collinear variables.
If \(x_2\) and \(x_3\) are collinear but you're interested in \(x_1\), you'll be fine.
We face Heteroskedasticity when the variance of the error term isn't constant for different combinations of our \(x\) variables.
(Remember: hetero means different, homo means the same. Homogonized milk is consistent, heterogonized milk is cottage cheese.)
For example, the variance of the errors for this regression increases with the size of the bill.
When we have homoskedasticity (same residual variance throughout) our estimates of standard error will be accurate.
Later in the semester we will see ways to deal with heteroskedasticity.
We're estimating the relationship between \(y\) and \(x\) while assuming that the Gauss-Markov conditions hold.
An important way they might not hold is if our error term isn't well behaved because we've omitted an important variable.
You estimate that a school's average SAT score increases by 20 points as per student spending increases by $100
Can you conclude that richer schools are better schools?
Richer schools probably also have richer parents. Richer parents are more likely to pay for test prep.
If you don't control for this, your estimate of the effects of school spending will be biased upward.
Multiple linear regression let's us see the effect of \(x_i\) on \(y\) holding constant \(x_j\) (where \(x_j\) is any other variable you included in your model).
The model is linear in the parameters but we can approximate non-linear effects with the right (in)dependent variables (e.g. \(ln(y)\) or \(x^2\)).
Ordinary Least Squares (OLS) provides an estimate for slope parameters that represent the partial effect of a change in a variable (i.e. holding constant all the other independent variables)
We can measure how well our model fits the data by calculating the \(R^2\) which is a ratio showing how much of the total variation our model explains. A higher \(R^2\) is not always a good thing.
Gauss-Markov Assumptions (1-4):
If these assumptions hold, our OLS estimators are unbiased–they will tend to reflect the true relationship between our \(x\) variables and \(y\).
Number 4 implies that we've correctly specified the model so that (for example) \(u\) isn't capturing the effect of a variable that matters but we left out of the model.
The fifth Gauss-Markov Assumption: homoskedasticity.
Without homoskedasticity, our estimates of the standard error become unreliable and it becomes harder to do hypothesis testing.
We're adding an extra assumption: not only is \(u\) independent of our \(x\) variables, but it's normally distributed.
\[u \sim N(mean = 0,variance = \sigma^2)\]
This is a good starting point because it makes hypothesis testing relatively easy.
Recall: the factors we're not explicitly including in our model get rolled into the error term. If these sources of randomness are additive then we can appeal to the Central Limit Theorem to justify this assumption.
## ## Call: lm(formula = mpg ~ hp + wt + disp, data = mtcars) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 37.105505 2.110815 17.579 < 1e-04 *** ## hp -0.031157 0.011436 -2.724 0.01097 * ## wt -3.800891 1.066191 -3.565 0.00133 ** ## disp -0.000937 0.010350 -0.091 0.92851 ## ## Residual standard error: 2.639 on 28 degrees of freedom ## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083 ## F-statistic: 44.57 on 3 and 28 DF, p-value: < 1e-04
confint(fit1)
## 2.5 % 97.5 % ## (Intercept) 32.78169625 41.429314293 ## hp -0.05458171 -0.007731388 ## wt -5.98488310 -1.616898063 ## disp -0.02213750 0.020263482
confint(fit1,2:4,0.99)
## 0.5 % 99.5 % ## hp -0.06275665 0.0004435502 ## wt -6.74705515 -0.8547260192 ## disp -0.02953607 0.0276620523
Recall from earlier:
\[t = \frac{\beta_i - \beta_{NULL}}{se(\beta_i)}\]
Where the standard error is our estimate of the standard deviation of \(\beta_i\).
The t distribution let's us figure out the probability of our estimate under the null hypothesis.
We can also start with a specific probability (e.g. 95%) and find the values of \(t\) that correspond with that probability. Those values (e.g. -2 and +2) tell us how many standard deviations wide our confidence interval needs to be.
For 28 degrees of freedom, and 95% confidence, what are the critical values of \(t\)?
qt(.975,28)
## [1] 2.048407
qt(.025,28)
## [1] -2.048407
If we multiply those numbers by the standard error of the wt
coefficient (1.066191) we find out how far above or below the coefficient estimate we go to make our confidence interval.
qt(.975,28) * 1.066191
## [1] 2.183993
We estimate that an extra ton of weight reduces gas milage by 3.8 mpg (point estimate), but we're 95% "confident" the true effect is 3.8 mpg give or take 2.18 mpg (interval estimate).
## 2.5 % 97.5 % ## (Intercept) 32.78169625 41.429314293 ## hp -0.05458171 -0.007731388 ## wt -5.98488310 -1.616898063 ## disp -0.02213750 0.020263482
We know how the t-test works, but let's see how the computer is calculating the standard error…
Review question: what is the relationship between variance and standard deviation?
\[ var(\hat{\beta_j}) = \frac{\sigma^2}{SST_j(1-R^2_j)}\]
\(\hat{\beta_j}\) is a random variable because it varies based on what sample we (randomly) pick. This equation calculates the variance of this random variable.
\(\sigma^2 = var(y|x)\)… the conditional variance of y.
\[ var(\hat{\beta_j}) = \frac{\sigma^2}{SST_j(1-R^2_j)}\]
\(SST_j = \sum_{i=1}^n (x_{ij} - \bar{x_j})^2\)… divided by \(n-1\) would give us the variance of this \(x\) variable. In this case we can think of it as total variation in \(x_j\).
\(R^2_j\) is the \(R^2\) (ratio of explained variation) from a regression of this \(x\) variable (\(x_j\)) on the other \(x\) variables (\(x_i, i\neq j\)) in our model.
\[ var(\hat{\beta_j}) = \frac{\sigma^2}{SST_j(1-R^2_j)}\]
Notice:
wt
This is what's going on under the hood when R
calculates your standard errors…
fitwt <- lm(wt ~ hp + disp, mtcars) # To get R^2_wt R2wt <- summary(fitwt)$r.squared SSTwt <- with(mtcars,sum((wt - mean(wt))^2)) # SS deviations from mean sigma2 <- summary(fit1)$sigma^2 sigma2/(SSTwt*(1-R2wt)) # variance of beta_wt
## [1] 1.136762
(sigma2/(SSTwt*(1-R2wt)))^0.5 # standard error
## [1] 1.066191
wt
## ## Call: lm(formula = mpg ~ hp + wt + disp, data = mtcars) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 37.105505 2.110815 17.579 < 1e-04 *** ## hp -0.031157 0.011436 -2.724 0.01097 * ## wt -3.800891 1.066191 -3.565 0.00133 ** ## disp -0.000937 0.010350 -0.091 0.92851 ## ## Residual standard error: 2.639 on 28 degrees of freedom ## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083 ## F-statistic: 44.57 on 3 and 28 DF, p-value: < 1e-04
\[t = \frac{\beta_i - \beta_{NULL}}{se(\beta_i)}\]
Under the null hypothesis that \(x_i\) doesn't, in fact, affect \(y\), we test whether our observed \(\hat{\beta_i}\) is in fact different from 0.
We compare the t-statistic with the t distribution for the appropriate number of degrees of freedom (\(n-k-1\)).
That gives us the p-value–the probability of finding \(\hat{\beta_i}\) if the null hypothesis is true.
Commentary on social science (e.g. economics, psychology) has been worrying about "p-hacking."
Even people trained in statistics can be fooled by randomness.
Selection effects can lead to weird outcomes.
Any one estimate should be taken with a grain of salt.
When studying something you will try several different specifications, sometimes different datasets, etc.
But you'll ultimately report on just a handful of models.