Hi everyone! Welcome to week 3 lab! Today we’re going to go over your problem set and get help. We’re going to be working with ggplot, lm, and some heteroskedastic corrected regressions.
A few reminders for you about your problemset: First, make sure to put all of your answers in the text document you submit (NOT THE R FILE). This is for two reasons:
Also, make sure to start this homework early! It’s a really long one with a lot of different parts. The only question that requires R is question 2, so not much coding is required, but we can help with all of the problems. Be sure to reach out.
Part 1: We need to load up our package toolkit. We’ll grab tidyverse and broom, along with pacman. We’re also grabbing lfe to get access to heteroskedastic-consistent standard errors.
library(pacman)
p_load(tidyverse, broom, lfe)
if you don’t have pacman yet, then run install.packages('pacman') first.
If for some reason pacman doesn’t work for you, you can run the following lines:
library(tidyverse)
library(broom)
library(lfe)
Our second step, as per usual, is to get our dataset and take a peak at it.
#where 'yourpath' is the filepath of your problemset data, surrounded by single or double quotes:
dataset<-read_csv(yourpath)
## Parsed with column specification:
## cols(
## prob_q5_q1 = col_double(),
## i_urban = col_double(),
## share_black = col_double(),
## share_middleclass = col_double(),
## share_divorced = col_double(),
## share_married = col_double()
## )
Ok, who can remember how to look at the first few rows of a df? Last few rows?
#answer:
(head(dataset, 3))
## # A tibble: 3 x 6
## prob_q5_q1 i_urban share_black share_middlecla… share_divorced
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0621 1 0.0208 0.548 0.110
## 2 0.0537 1 0.0198 0.538 0.116
## 3 0.0731 0 0.0146 0.467 0.113
## # … with 1 more variable: share_married <dbl>
#answer:
tail(dataset, 3)
## # A tibble: 3 x 6
## prob_q5_q1 i_urban share_black share_middlecla… share_divorced
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.115 1 0.00638 0.564 0.0989
## 2 0.0866 0 0.00685 0.575 0.124
## 3 0.109 1 0.0432 0.514 0.116
## # … with 1 more variable: share_married <dbl>
ok, who can remeber how to get a collection of summary statistics for the df?
summary(dataset)
## prob_q5_q1 i_urban share_black share_middleclass
## Min. :0.02210 Min. :0.0000 Min. :0.0001596 Min. :0.2848
## 1st Qu.:0.06588 1st Qu.:0.0000 1st Qu.:0.0040385 1st Qu.:0.5001
## Median :0.08889 Median :0.0000 Median :0.0240551 Median :0.5523
## Mean :0.09761 Mean :0.4584 Mean :0.0808788 Mean :0.5499
## 3rd Qu.:0.11715 3rd Qu.:1.0000 3rd Qu.:0.0891569 3rd Qu.:0.6082
## Max. :0.35714 Max. :1.0000 Max. :0.6583261 Max. :0.7338
## share_divorced share_married
## Min. :0.03950 Min. :0.3729
## 1st Qu.:0.08513 1st Qu.:0.5444
## Median :0.09818 Median :0.5781
## Mean :0.09710 Mean :0.5726
## 3rd Qu.:0.10861 3rd Qu.:0.6047
## Max. :0.15618 Max. :0.6947
Nice.
Part 2: More about the summary.lm object
I want to take a quick second here to back up a step and poke around with the regression “summary” object, because it has more useful functions than simply giving us coefficients and stars. The technical name for this object is summary.lm which is where you should look for documentation if you’re trying to check your code.
Let’s say we were running a regression, and we wanted to find out if “share_black” had a substantial impact on our mobility statistic, but we also want to control for whether or not our interested parties live in an urban area. How do we do this? First, we need to create a regression object.
#regression objects take a dataset, and then a formula.
our_first_reg <- lm(data = dataset, prob_q5_q1 ~ share_black + i_urban + share_divorced + share_married + share_middleclass:share_black + share_middleclass -1)
now we can use ‘summarize’ to get the coefficients.
summary(our_first_reg)
##
## Call:
## lm(formula = prob_q5_q1 ~ share_black + i_urban + share_divorced +
## share_married + share_middleclass:share_black + share_middleclass -
## 1, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.075086 -0.019672 -0.004817 0.013552 0.202358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## share_black 0.232741 0.063295 3.677 0.000254 ***
## i_urban -0.008592 0.002662 -3.228 0.001304 **
## share_divorced -0.823778 0.065113 -12.652 < 2e-16 ***
## share_married 0.182060 0.022511 8.088 2.66e-15 ***
## share_middleclass 0.160535 0.020026 8.016 4.55e-15 ***
## share_black:share_middleclass -0.779569 0.148158 -5.262 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03241 on 703 degrees of freedom
## Multiple R-squared: 0.9119, Adjusted R-squared: 0.9112
## F-statistic: 1213 on 6 and 703 DF, p-value: < 2.2e-16
awesome! But today we’re talking about how to check for heteroskedasticity. What do we need to look at to drill into heteroskedasticity? The errors! We can get this from our summary object with the following code
sumobj <- summary(our_first_reg)
#to look at the first few residuals, use head. If you want all of them drop that part.
head(sumobj$residuals)
## 1 2 3 4 5 6
## -0.03186904 -0.03650299 -0.01395689 -0.01758164 -0.03039226 -0.01938582
ok, nice. There’s actually a ton of options here, you can see for yourself!
sumobj$...
we have ‘call’ which spits back our command, terms, which gives us a bunch of information about what R thinks about our variables,residuals, which returns our residuals, coefficients, which returns a matrix of our results, aliased we’ll skip, sigma, which is the square root of the estimated variance of the random error, some r-squareds, and a variance covariance matrix.
we’ll use resid() to get our residuals for our problemset, but this just goes to show all of the information stored in that summary object. We will use this to get R-squared in an efficient fashion.
let’s practice reading one of these summary objects really quickly before we move on.
sumobj
##
## Call:
## lm(formula = prob_q5_q1 ~ share_black + i_urban + share_divorced +
## share_married + share_middleclass:share_black + share_middleclass -
## 1, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.075086 -0.019672 -0.004817 0.013552 0.202358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## share_black 0.232741 0.063295 3.677 0.000254 ***
## i_urban -0.008592 0.002662 -3.228 0.001304 **
## share_divorced -0.823778 0.065113 -12.652 < 2e-16 ***
## share_married 0.182060 0.022511 8.088 2.66e-15 ***
## share_middleclass 0.160535 0.020026 8.016 4.55e-15 ***
## share_black:share_middleclass -0.779569 0.148158 -5.262 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03241 on 703 degrees of freedom
## Multiple R-squared: 0.9119, Adjusted R-squared: 0.9112
## F-statistic: 1213 on 6 and 703 DF, p-value: < 2.2e-16
Ok. Which variables are statistically significant?
How do we determine what percent of our total variation in y is explained by the x’s?
what column represents the test statistic?
Great, now that we’ve had a little review, let’s move on to analyzing heteroskedasticity which should help with your hw.
Now. We need to get a handle on how to check our data out for heteroskedasticity.Let’s look using the goldfeld quandt test first. Let’s try to figure out what exactly we need to look for in a heteroskedasticity test. What exactly are we testing for here? We want to be sure that our variance isn’t changing with our predictors. Well, what’s one way of checking that? How can we do this? One way is to compare different values of our predicted variable and see how our errors are changing that.
That’s what the GQ test is trying to do! But we need a process to run this test. You can actually run this test with tools from R you’ve already seen so far. Let’s do that.
First, we need to pick a fraction of our sample to compare. Let’s look at the first and last 1/4 of our sample and then test if they have the same variances.
What we need to do: compare the variances of first 1/4 and the last 1/4 ranked by the explaining variable of interest
This looks at the ratio and sees how far it is from 1. Think about it. If the ratio is 1, then across different values of your variable of interest, your errors are roughly the same.
Note that Goldfeld-Quant only allows you to look at one varibale at a time. Let’s focus on share_black at the moment
If we wanted to do this by hand, we’d need to follow some steps. 6 steps in total. - 1.) Order your observations by your variable of interest, in this case, let’s do income. It looked the worst in our graphs, so it makes sense to check here.
in order to order a variable, all we need to do is sort it, using the arrange() command.
dataset <- arrange(dataset, share_black)
head(dataset$share_black, 4)
## [1] 0.0001596 0.0001783 0.0001800 0.0001951
we can see the share_black variable is now in increasing order
nrow() which counts the number of rows in a df.n_GQ <- as.integer(nrow(dataset) * 1/ 4)
n_GQ #this will give us the closest count that gives us the number that is 1/4 of our full sample!
## [1] 177
#on the last 1/4
lm_g1 <- lm(data = tail(dataset, n_GQ), prob_q5_q1 ~ share_black + i_urban + share_middleclass + share_divorced) #using the tail command, we are running the reg
#on the first 1/4
lm_g2 <- lm(data = head(dataset, n_GQ), prob_q5_q1 ~ share_black + i_urban + share_middleclass + share_divorced) #using head, we are doing the above on the first ones
We need to recover the SSEs for our test, so we need to build those. We’ll use the sum() command, as well as a new resid() command that lets us recover the residuals from a regression! Pretty cool?
e_g1 <- resid(lm_g1) #the resid command gets us our models' residuals!
e_g2 <- resid(lm_g2) #g2 for group two
sse_g1 <- sum(e_g1 ^ 2) #now, to get SSE, we need to square the residuals, and then sum them.
sse_g2 <- sum(e_g2 ^ 2)
let’s peek at our new objects
(sse_g1)
## [1] 0.0516712
sse_g2
## [1] 0.3679401
The sum of squared errors on group two is MUCH bigger than the first one. What does that really mean though? Our model is doing a better job of guessing those small values than larger ones.
How do we do this? What kind of test is the GQ test? Look it up! You can often find answers through a variety of online tools, and those can be useful to find your way out of a bug as well.
stat_GQ <- (sse_g2 / sse_g1)
stat_GQ
## [1] 7.120796
Great. So did you look up the test type? An F-test!
We can use an F-test to produce a p-value for our calculated statistic.
Ok. So. In general, any distribution you can think of has a set of functions you can use on R. Generally, they are prefixed with r, p, q, and d. These correspond to: draws from that distribution (r), likelihood of seeing a number that size (p), the quantile (q) and the density (d). Most often you’ll see r and p used but the other ones are useful as well.
Before we solve our question, let’s see what happens if we have IDENTICAL sse for both groups…
p_GQ_eq <- pf(q=100/100, df1 = n_GQ-4-1, df2 = n_GQ-4-1, lower.tail = FALSE)
p_GQ_eq
## [1] 0.5
.5. That is, it’s a (perfectly fair) coin flip, and we definitely can’t reject h0. What about 0? What does that mean?
p_GQ_eq <- pf(q=0, df1 = n_GQ-4-1, df2 = n_GQ-4-1)
p_GQ_eq
## [1] 0
p_GQ <- pf(q = stat_GQ, df1 = n_GQ-4-1, df2 = n_GQ-4-1, lower.tail = F) #pf gives probability from an f-dist.
p_GQ
## [1] 1.766369e-33
So what can we say? I’d challenge you to figure out EXACTLY what our null hypothesis is here. That’s our next step:
The Null Hypothesis of Goldfeld-Quant is….. H0:The variances of the residuals from regressions using the first 1/4 and the last 1/4 of the dataset ordered by average income are the same.
The Alternative Hypothesis of Goldfeld-Quant is……….HA: The variances of the residuals from regressions using the first 1/4 and the last 1/4 of the dataset ordered by average income are different
If we define sigma_1 = std deviation of the first quartile, and sigma_2 as the std deviation of the last quartile…
Conclusion, can we reject H0?
Yes! At the 5% significance level. Let’s look at another test we could use.
Again, we need to follow a pile of steps to run this test. In fact, we need another 6. How do we go about this?
Think about heteroskedasticity like an econometrician, how would you test for it? If you want to see if your variables are correlated with your errors, why not just run a regression? What would we need to do this?
Well our first goal is to get the residuals from our regression of interest. let’s get those.
lm_frst <- lm(data = dataset, prob_q5_q1 ~ share_black + i_urban + share_middleclass + share_divorced)
Let’s use the resid() command again
e_BP <- resid(lm_frst)
lm_BrPa <- lm(data = dataset, I(e_BP ^ 2) ~ share_black + i_urban + share_middleclass + share_divorced)
Now we need to do something a little strange. We’re going to call the r.squared object from within the lm summary object. Why do we need or want R-squared? Well, it’s a great test statistic since it literally is how much of the variation in y is explained by our independent variables.
When have we called objects from within objects before? Yep! Dataframes when calling variables, using the ‘$’. So we use the same formatting.
r2_BP <- summary(lm_BrPa)$r.squared #this is letting us get r squared!
LM_BP <- nrow(dataset) * r2_BP
pchisq(q = LM_BP, df = 4, lower.tail = F) #this function calculates a chi-squared distribution, which is how we get our bp test probability.
## [1] 1.098492e-09
If you’re wondering why we used 4 degrees of freedom (df in above function is equal to 4) I encourage you to look up Breusch Pagan for more details, however the general rule of thumb is that BP is a test with statistic of size n*R^2 and k degrees of freedom.
H0:b1 = b2 = b3 = b4 = 0 where b1, b2, b3, b4 are the coefficients of regression model:
(e.BP ^ 2 ~ b0 + b1 * black_share + b2 * i_urban + b3 * share_middleclass + b4*share_divorced)
What do we conclude?
We reject the null hypothesis with 5% significance level. We have one more test to walk through. The white test!
The white test is very similar to the Breusch-Pagan test, with a few modifications:
When regressing the squared errors in the B-P test, in addition to the original regressors, we add interaction terms and squared terms.
So the first two steps of the White test are identical to those in the Breusch-Pagan test. However, we are going to drop share_divorced to save you guys the time because you’ll get the gist of how to do this with just three variables.
White test steps!
lm_frst <- lm(data = dataset, prob_q5_q1 ~ share_black + i_urban + share_middleclass)
e_WhiteTest <- resid(lm_frst)
lm_White <- lm(data = dataset, I(e_BP ^ 2) ~ share_black + i_urban + share_middleclass + share_black:i_urban + share_black:share_middleclass +
+ i_urban:share_middleclass + I(share_black ^ 2) + I(i_urban ^ 2) + I(share_middleclass ^ 2))
Which of these did I needlessly include?
r2_White <- summary(lm_White)$r.squared
LM_White <- nrow(dataset) * r2_White
pchisq(q = LM_White, df = 9, lower.tail = F) #this is a chi-squared function. Takes value, outputs probability.
## [1] 1.149353e-06
So what can we say? What is our H0? What is HA? We’ve done enough of these that I think you can answer part 6 on your own.
Lets see how to graph your heteroskedastic standard errors. Let’s start with the basics: who can remember the structure of a ggplot function?
If you missed lab or you didn’t catch everything (there was a lot) you can find notes on them here
We build our plot in… LAYERS.
We need an aesthetic, a geom, and a dataset! Lets do that
ggplot(data = dataset, aes(x = prob_q5_q1)) +
geom_histogram(bins = 10, fill = 'orange') + #what do you think 'bins' is doing?
theme_minimal()
In terms of our raw distribution, what do we need?
Let’s look at heteroskedasticity, specifically let’s look at share_black
we need to use our old friend the ‘resid()’ command to get this. Let’s do that. Let’s also lose the intercept.
reg_prac <- lm(prob_q5_q1 ~ share_black, data = dataset)
now, to recover residuals we can use the resid command
errs <-resid(reg_prac)
ok, now we need to plot them against our chosen variable, share_black. Lets color these by i_urban, just out of interest.
ggplot(aes(x = share_black, y = errs, colour = i_urban), data = dataset) +
geom_point() +
theme_minimal()
Huh… that’s interesting. R colored this graph as if we wanted a continuous scale from 0-1. That’s not ideal. We could fix this by casting i_urban to a logical. By default, this will make all values of i_urban true if i_urban = 1, and false if i_urban = 0.
However, we got the result we wanted - there appears to be significant negative relationship between our error and share_black.
We don’t necessarily KNOW we have heteroskedasticity, we just suspect it. Let’s compare a normal regression to one with heteroskedasticity consistent standard errors.
reg_standard <- lm(data = dataset, prob_q5_q1 ~ share_black - 1)
reg_consistent <- felm(data = dataset, prob_q5_q1 ~ share_black -1) #we use `felm()` to extract consistent se from our model
#we can call the coefficients only from these objects by using the '$' operator.
res_standard <- summary(reg_standard)
res_consistent <- summary(reg_consistent, robust = TRUE) #add robust = TRUE to get robust SE.
(res_consistent)
##
## Call:
## felm(formula = prob_q5_q1 ~ share_black - 1, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12076 0.04936 0.08115 0.11365 0.35662
##
## Coefficients:
## Estimate Robust s.e t value Pr(>|t|)
## share_black 0.21700 0.01489 14.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.104 on 708 degrees of freedom
## Multiple R-squared(full model): -3.691 Adjusted R-squared: -3.698
## Multiple R-squared(proj model): -3.691 Adjusted R-squared: -3.698
## F-statistic(full model, *iid*):-557.1 on 1 and 708 DF, p-value: 1
## F-statistic(proj model): 212.4 on 1 and 708 DF, p-value: < 2.2e-16
res_standard
##
## Call:
## lm(formula = prob_q5_q1 ~ share_black - 1, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12076 0.04936 0.08115 0.11365 0.35662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## share_black 0.21700 0.02639 8.223 9.48e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.104 on 708 degrees of freedom
## Multiple R-squared: 0.08718, Adjusted R-squared: 0.08589
## F-statistic: 67.62 on 1 and 708 DF, p-value: 9.481e-16
Ok. Who can where’s waldo this? What’s different?
Great, you guys should be able to do anything R-related on your PS02 now, so good luck, and get started early. Don’t be afraid to reach out, but please don’t save all of your questions for Sunday at 11pm. Assume we need time to answer questions and that it might take us a moment to understand what you’re asking.
Have a great week, and good luck with your classes!