Lab 3: PS2 Help

Hi everyone! Welcome to week 3 lab! Today we’re going to go over your problem set and get help. We’re going to be working with ggplot, lm, and some heteroskedastic corrected regressions.

A few reminders for you about your problemset: First, make sure to put all of your answers in the text document you submit (NOT THE R FILE). This is for two reasons:

Also, make sure to start this homework early! It’s a really long one with a lot of different parts. The only question that requires R is question 2, so not much coding is required, but we can help with all of the problems. Be sure to reach out.

Problem 2

Part 1: We need to load up our package toolkit. We’ll grab tidyverse and broom, along with pacman. We’re also grabbing lfe to get access to heteroskedastic-consistent standard errors.

library(pacman)

p_load(tidyverse, broom, lfe)

if you don’t have pacman yet, then run install.packages('pacman') first.

If for some reason pacman doesn’t work for you, you can run the following lines:

library(tidyverse)
library(broom)
library(lfe)

Our second step, as per usual, is to get our dataset and take a peak at it.

#where 'yourpath' is the filepath of your problemset data, surrounded by single or double quotes:
dataset<-read_csv(yourpath)
## Parsed with column specification:
## cols(
##   prob_q5_q1 = col_double(),
##   i_urban = col_double(),
##   share_black = col_double(),
##   share_middleclass = col_double(),
##   share_divorced = col_double(),
##   share_married = col_double()
## )

Ok, who can remember how to look at the first few rows of a df? Last few rows?

#answer:
(head(dataset, 3))
## # A tibble: 3 x 6
##   prob_q5_q1 i_urban share_black share_middlecla… share_divorced
##        <dbl>   <dbl>       <dbl>            <dbl>          <dbl>
## 1     0.0621       1      0.0208            0.548          0.110
## 2     0.0537       1      0.0198            0.538          0.116
## 3     0.0731       0      0.0146            0.467          0.113
## # … with 1 more variable: share_married <dbl>
#answer:
tail(dataset, 3)
## # A tibble: 3 x 6
##   prob_q5_q1 i_urban share_black share_middlecla… share_divorced
##        <dbl>   <dbl>       <dbl>            <dbl>          <dbl>
## 1     0.115        1     0.00638            0.564         0.0989
## 2     0.0866       0     0.00685            0.575         0.124 
## 3     0.109        1     0.0432             0.514         0.116 
## # … with 1 more variable: share_married <dbl>

ok, who can remeber how to get a collection of summary statistics for the df?

summary(dataset)
##    prob_q5_q1         i_urban        share_black        share_middleclass
##  Min.   :0.02210   Min.   :0.0000   Min.   :0.0001596   Min.   :0.2848   
##  1st Qu.:0.06588   1st Qu.:0.0000   1st Qu.:0.0040385   1st Qu.:0.5001   
##  Median :0.08889   Median :0.0000   Median :0.0240551   Median :0.5523   
##  Mean   :0.09761   Mean   :0.4584   Mean   :0.0808788   Mean   :0.5499   
##  3rd Qu.:0.11715   3rd Qu.:1.0000   3rd Qu.:0.0891569   3rd Qu.:0.6082   
##  Max.   :0.35714   Max.   :1.0000   Max.   :0.6583261   Max.   :0.7338   
##  share_divorced    share_married   
##  Min.   :0.03950   Min.   :0.3729  
##  1st Qu.:0.08513   1st Qu.:0.5444  
##  Median :0.09818   Median :0.5781  
##  Mean   :0.09710   Mean   :0.5726  
##  3rd Qu.:0.10861   3rd Qu.:0.6047  
##  Max.   :0.15618   Max.   :0.6947

Nice.

Part 2: More about the summary.lm object

I want to take a quick second here to back up a step and poke around with the regression “summary” object, because it has more useful functions than simply giving us coefficients and stars. The technical name for this object is summary.lm which is where you should look for documentation if you’re trying to check your code.

Let’s say we were running a regression, and we wanted to find out if “share_black” had a substantial impact on our mobility statistic, but we also want to control for whether or not our interested parties live in an urban area. How do we do this? First, we need to create a regression object.

#regression objects take a dataset, and then a formula.
our_first_reg <- lm(data = dataset, prob_q5_q1 ~ share_black + i_urban + share_divorced + share_married + share_middleclass:share_black + share_middleclass -1)

now we can use ‘summarize’ to get the coefficients.

summary(our_first_reg)
## 
## Call:
## lm(formula = prob_q5_q1 ~ share_black + i_urban + share_divorced + 
##     share_married + share_middleclass:share_black + share_middleclass - 
##     1, data = dataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.075086 -0.019672 -0.004817  0.013552  0.202358 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## share_black                    0.232741   0.063295   3.677 0.000254 ***
## i_urban                       -0.008592   0.002662  -3.228 0.001304 ** 
## share_divorced                -0.823778   0.065113 -12.652  < 2e-16 ***
## share_married                  0.182060   0.022511   8.088 2.66e-15 ***
## share_middleclass              0.160535   0.020026   8.016 4.55e-15 ***
## share_black:share_middleclass -0.779569   0.148158  -5.262 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03241 on 703 degrees of freedom
## Multiple R-squared:  0.9119, Adjusted R-squared:  0.9112 
## F-statistic:  1213 on 6 and 703 DF,  p-value: < 2.2e-16

awesome! But today we’re talking about how to check for heteroskedasticity. What do we need to look at to drill into heteroskedasticity? The errors! We can get this from our summary object with the following code

sumobj <- summary(our_first_reg)

#to look at the first few residuals, use head. If you want all of them drop that part.
head(sumobj$residuals)
##           1           2           3           4           5           6 
## -0.03186904 -0.03650299 -0.01395689 -0.01758164 -0.03039226 -0.01938582

ok, nice. There’s actually a ton of options here, you can see for yourself!

sumobj$...

we have ‘call’ which spits back our command, terms, which gives us a bunch of information about what R thinks about our variables,residuals, which returns our residuals, coefficients, which returns a matrix of our results, aliased we’ll skip, sigma, which is the square root of the estimated variance of the random error, some r-squareds, and a variance covariance matrix.

we’ll use resid() to get our residuals for our problemset, but this just goes to show all of the information stored in that summary object. We will use this to get R-squared in an efficient fashion.

let’s practice reading one of these summary objects really quickly before we move on.

sumobj
## 
## Call:
## lm(formula = prob_q5_q1 ~ share_black + i_urban + share_divorced + 
##     share_married + share_middleclass:share_black + share_middleclass - 
##     1, data = dataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.075086 -0.019672 -0.004817  0.013552  0.202358 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## share_black                    0.232741   0.063295   3.677 0.000254 ***
## i_urban                       -0.008592   0.002662  -3.228 0.001304 ** 
## share_divorced                -0.823778   0.065113 -12.652  < 2e-16 ***
## share_married                  0.182060   0.022511   8.088 2.66e-15 ***
## share_middleclass              0.160535   0.020026   8.016 4.55e-15 ***
## share_black:share_middleclass -0.779569   0.148158  -5.262 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03241 on 703 degrees of freedom
## Multiple R-squared:  0.9119, Adjusted R-squared:  0.9112 
## F-statistic:  1213 on 6 and 703 DF,  p-value: < 2.2e-16

Ok. Which variables are statistically significant?

How do we determine what percent of our total variation in y is explained by the x’s?

what column represents the test statistic?

Great, now that we’ve had a little review, let’s move on to analyzing heteroskedasticity which should help with your hw.

Heteroskedasticity

Now. We need to get a handle on how to check our data out for heteroskedasticity.Let’s look using the goldfeld quandt test first. Let’s try to figure out what exactly we need to look for in a heteroskedasticity test. What exactly are we testing for here? We want to be sure that our variance isn’t changing with our predictors. Well, what’s one way of checking that? How can we do this? One way is to compare different values of our predicted variable and see how our errors are changing that.

That’s what the GQ test is trying to do! But we need a process to run this test. You can actually run this test with tools from R you’ve already seen so far. Let’s do that.

First, we need to pick a fraction of our sample to compare. Let’s look at the first and last 1/4 of our sample and then test if they have the same variances.

What we need to do: compare the variances of first 1/4 and the last 1/4 ranked by the explaining variable of interest

This looks at the ratio and sees how far it is from 1. Think about it. If the ratio is 1, then across different values of your variable of interest, your errors are roughly the same.

Note that Goldfeld-Quant only allows you to look at one varibale at a time. Let’s focus on share_black at the moment

Goldfeld Quandt

If we wanted to do this by hand, we’d need to follow some steps. 6 steps in total. - 1.) Order your observations by your variable of interest, in this case, let’s do income. It looked the worst in our graphs, so it makes sense to check here.

in order to order a variable, all we need to do is sort it, using the arrange() command.

dataset <- arrange(dataset, share_black)
head(dataset$share_black, 4)
## [1] 0.0001596 0.0001783 0.0001800 0.0001951

we can see the share_black variable is now in increasing order

  • 2.) Split the data into two groups, in appropriate sizes. We’ve chosen to one fourth of our dataset to be our sample size. We need to know what 1/4 of our dataset is. We can do this by finding n through nrow() which counts the number of rows in a df.
n_GQ <- as.integer(nrow(dataset) * 1/ 4)

n_GQ #this will give us the closest count that gives us the number that is 1/4 of our full sample!
## [1] 177
  • 3.) We forgot something! What are we trying to explain with this variable ‘share_black’? Well, let’s try to run a regression with all of our variables this time. Let’s try to estimate prob_q5_q1 which is the likelihood of moving to the upper quintile of income from the lower. Because we ordered our data, we can use head and tail to grab the largest and smallest values of our dataset.
#on the last 1/4
lm_g1 <- lm(data = tail(dataset, n_GQ), prob_q5_q1 ~ share_black + i_urban + share_middleclass + share_divorced) #using the tail command, we are running the reg
#on the first 1/4
lm_g2 <- lm(data = head(dataset, n_GQ), prob_q5_q1 ~ share_black + i_urban + share_middleclass + share_divorced) #using head, we are doing the above on the first ones
  • 4.) Record our sum of square errors to test

We need to recover the SSEs for our test, so we need to build those. We’ll use the sum() command, as well as a new resid() command that lets us recover the residuals from a regression! Pretty cool?

e_g1 <- resid(lm_g1) #the resid command gets us our models' residuals!
e_g2 <- resid(lm_g2) #g2 for group two
sse_g1 <- sum(e_g1 ^ 2) #now, to get SSE, we need to square the residuals, and then sum them.
sse_g2 <- sum(e_g2 ^ 2)

let’s peek at our new objects

(sse_g1)
## [1] 0.0516712
sse_g2
## [1] 0.3679401

The sum of squared errors on group two is MUCH bigger than the first one. What does that really mean though? Our model is doing a better job of guessing those small values than larger ones.

  • 5.) Calculate the G-Q test stastistic, and compute the p-value.

How do we do this? What kind of test is the GQ test? Look it up! You can often find answers through a variety of online tools, and those can be useful to find your way out of a bug as well.

stat_GQ <- (sse_g2 / sse_g1)
stat_GQ
## [1] 7.120796

Great. So did you look up the test type? An F-test!

We can use an F-test to produce a p-value for our calculated statistic.

Ok. So. In general, any distribution you can think of has a set of functions you can use on R. Generally, they are prefixed with r, p, q, and d. These correspond to: draws from that distribution (r), likelihood of seeing a number that size (p), the quantile (q) and the density (d). Most often you’ll see r and p used but the other ones are useful as well.

Before we solve our question, let’s see what happens if we have IDENTICAL sse for both groups…

p_GQ_eq <- pf(q=100/100, df1 = n_GQ-4-1, df2 = n_GQ-4-1, lower.tail = FALSE)
p_GQ_eq
## [1] 0.5

.5. That is, it’s a (perfectly fair) coin flip, and we definitely can’t reject h0. What about 0? What does that mean?

p_GQ_eq <- pf(q=0, df1 = n_GQ-4-1, df2 = n_GQ-4-1)
p_GQ_eq
## [1] 0
  1. That means there’s no chance these error variances are the same. Think a bit about why that might be.
p_GQ <- pf(q = stat_GQ, df1 = n_GQ-4-1, df2 = n_GQ-4-1, lower.tail = F) #pf gives probability from an f-dist.
p_GQ
## [1] 1.766369e-33

So what can we say? I’d challenge you to figure out EXACTLY what our null hypothesis is here. That’s our next step:

  • 6.) State the null hypothesis and draw conclusion,

The Null Hypothesis of Goldfeld-Quant is….. H0:The variances of the residuals from regressions using the first 1/4 and the last 1/4 of the dataset ordered by average income are the same.

The Alternative Hypothesis of Goldfeld-Quant is……….HA: The variances of the residuals from regressions using the first 1/4 and the last 1/4 of the dataset ordered by average income are different

If we define sigma_1 = std deviation of the first quartile, and sigma_2 as the std deviation of the last quartile…

  • More formally: H0: sigma_1^2 = sigma_2^2, and HA: sigma_1^2 != sigma_2^2

Conclusion, can we reject H0?

Yes! At the 5% significance level. Let’s look at another test we could use.

Breusch-Pagan test:

Again, we need to follow a pile of steps to run this test. In fact, we need another 6. How do we go about this?

Think about heteroskedasticity like an econometrician, how would you test for it? If you want to see if your variables are correlated with your errors, why not just run a regression? What would we need to do this?

Well our first goal is to get the residuals from our regression of interest. let’s get those.

  • 1.) Regress y on x1 -> x4 so we have something to mess with
lm_frst <- lm(data = dataset, prob_q5_q1 ~ share_black + i_urban + share_middleclass + share_divorced)
  • 2.) Recover residuals from our original regression

Let’s use the resid() command again

e_BP <- resid(lm_frst) 
  • 3.) Regress our squared errors (e^2) on an intercept, x1, x2, … xk, or our explanatory variables.
lm_BrPa <- lm(data = dataset, I(e_BP ^ 2) ~ share_black + i_urban + share_middleclass + share_divorced)
  • 4.) Record R^2. This one is a little trickier…

Now we need to do something a little strange. We’re going to call the r.squared object from within the lm summary object. Why do we need or want R-squared? Well, it’s a great test statistic since it literally is how much of the variation in y is explained by our independent variables.

When have we called objects from within objects before? Yep! Dataframes when calling variables, using the ‘$’. So we use the same formatting.

r2_BP <- summary(lm_BrPa)$r.squared #this is letting us get r squared!
  • 5.) Compute the Bruesch-Pagan statisic (called Langrange Multiplier) and the P-value
LM_BP <- nrow(dataset) * r2_BP
pchisq(q = LM_BP, df = 4, lower.tail = F) #this function calculates a chi-squared distribution, which is how we get our bp test probability.
## [1] 1.098492e-09

If you’re wondering why we used 4 degrees of freedom (df in above function is equal to 4) I encourage you to look up Breusch Pagan for more details, however the general rule of thumb is that BP is a test with statistic of size n*R^2 and k degrees of freedom.

  • 6.) State the null hypothesis and draw conclusion. What is H0?

H0:b1 = b2 = b3 = b4 = 0 where b1, b2, b3, b4 are the coefficients of regression model:

(e.BP ^ 2 ~ b0 + b1 * black_share + b2 * i_urban + b3 * share_middleclass + b4*share_divorced)

What do we conclude?

We reject the null hypothesis with 5% significance level. We have one more test to walk through. The white test!

White test

The white test is very similar to the Breusch-Pagan test, with a few modifications:

When regressing the squared errors in the B-P test, in addition to the original regressors, we add interaction terms and squared terms.

So the first two steps of the White test are identical to those in the Breusch-Pagan test. However, we are going to drop share_divorced to save you guys the time because you’ll get the gist of how to do this with just three variables.

White test steps!

  • 1.) run our reg
lm_frst <- lm(data = dataset, prob_q5_q1 ~ share_black + i_urban + share_middleclass)
  • 2.) Record residuals from our reg. Also, same as BP
e_WhiteTest <- resid(lm_frst)
  • 3.) Here is where we change things ever so slightly. Rather than regressing e^2 on an intercept, x1, x2, … xk alone, we need to add the interaction and square terms.
lm_White <- lm(data = dataset, I(e_BP ^ 2) ~ share_black + i_urban + share_middleclass + share_black:i_urban + share_black:share_middleclass + 
                  + i_urban:share_middleclass + I(share_black ^ 2) + I(i_urban ^ 2) + I(share_middleclass ^ 2))

Which of these did I needlessly include?

  • 4.) Record R^2. Back to the BP script, but with a new regression model
r2_White <- summary(lm_White)$r.squared
  • 5.) Compute the Bruesch-Pagan statisic (called Langrange Multiplier) using this new r-squared value
LM_White <- nrow(dataset) * r2_White 
pchisq(q = LM_White, df = 9, lower.tail = F) #this is a chi-squared function. Takes value, outputs probability.
## [1] 1.149353e-06

So what can we say? What is our H0? What is HA? We’ve done enough of these that I think you can answer part 6 on your own.

Lesson 3: graphing for PS02 and consistent se

Lets see how to graph your heteroskedastic standard errors. Let’s start with the basics: who can remember the structure of a ggplot function?

If you missed lab or you didn’t catch everything (there was a lot) you can find notes on them here

We build our plot in… LAYERS.

We need an aesthetic, a geom, and a dataset! Lets do that

ggplot(data = dataset, aes(x = prob_q5_q1)) + 
  geom_histogram(bins = 10, fill = 'orange') + #what do you think 'bins' is doing?
  theme_minimal()

In terms of our raw distribution, what do we need?

Let’s look at heteroskedasticity, specifically let’s look at share_black

we need to use our old friend the ‘resid()’ command to get this. Let’s do that. Let’s also lose the intercept.

reg_prac <- lm(prob_q5_q1 ~ share_black, data = dataset)

now, to recover residuals we can use the resid command

errs <-resid(reg_prac)

ok, now we need to plot them against our chosen variable, share_black. Lets color these by i_urban, just out of interest.

ggplot(aes(x = share_black, y = errs, colour = i_urban), data = dataset) + 
  geom_point() + 
  theme_minimal()

Huh… that’s interesting. R colored this graph as if we wanted a continuous scale from 0-1. That’s not ideal. We could fix this by casting i_urban to a logical. By default, this will make all values of i_urban true if i_urban = 1, and false if i_urban = 0.

However, we got the result we wanted - there appears to be significant negative relationship between our error and share_black.

We don’t necessarily KNOW we have heteroskedasticity, we just suspect it. Let’s compare a normal regression to one with heteroskedasticity consistent standard errors.

reg_standard <- lm(data = dataset, prob_q5_q1 ~ share_black - 1)

reg_consistent <- felm(data = dataset, prob_q5_q1 ~ share_black -1) #we use `felm()` to extract consistent se from our model

#we can call the coefficients only from these objects by using the '$' operator.

res_standard <- summary(reg_standard)

res_consistent <- summary(reg_consistent, robust = TRUE) #add robust = TRUE to get robust SE.
(res_consistent)
## 
## Call:
##    felm(formula = prob_q5_q1 ~ share_black - 1, data = dataset) 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12076  0.04936  0.08115  0.11365  0.35662 
## 
## Coefficients:
##             Estimate Robust s.e t value Pr(>|t|)    
## share_black  0.21700    0.01489   14.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.104 on 708 degrees of freedom
## Multiple R-squared(full model): -3.691   Adjusted R-squared: -3.698 
## Multiple R-squared(proj model): -3.691   Adjusted R-squared: -3.698 
## F-statistic(full model, *iid*):-557.1 on 1 and 708 DF, p-value: 1 
## F-statistic(proj model): 212.4 on 1 and 708 DF, p-value: < 2.2e-16
res_standard
## 
## Call:
## lm(formula = prob_q5_q1 ~ share_black - 1, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12076  0.04936  0.08115  0.11365  0.35662 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## share_black  0.21700    0.02639   8.223 9.48e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.104 on 708 degrees of freedom
## Multiple R-squared:  0.08718,    Adjusted R-squared:  0.08589 
## F-statistic: 67.62 on 1 and 708 DF,  p-value: 9.481e-16

Ok. Who can where’s waldo this? What’s different?

Great, you guys should be able to do anything R-related on your PS02 now, so good luck, and get started early. Don’t be afraid to reach out, but please don’t save all of your questions for Sunday at 11pm. Assume we need time to answer questions and that it might take us a moment to understand what you’re asking.

Have a great week, and good luck with your classes!

return to main notes page