We use R, knitr and Markdown.
This document can be read from any device as an HTML-version on http://rpubs.com/lindberg/115410
Note that the document, thanks to knitr magic, will be written in the following way. First we talk a bit, like in this paragraph. Then to find answer a question, for example “How do you add 1 and 2?” we write some code and under the code we print the output. So here comes the code and the output
1+2
## [1] 3
Note that the [1] indicates that this is the first line output. If there are more lines, a [2] would be added to row 2 of the ouput
Here we
We load the data into Stata using
use "/Users/jacoblindberg/Dropbox/pathtofile/gpa2.dta"
describe
Running describe in Stata give us an output where we learn the definitions of the variables.
sat combined SAT score
tothrs total hours through fall semest
colgpa GPA after fall semester
athlete =1 if athlete
verbmath verbal/math SAT score
hsize size grad. class, 100s
hsrank rank in grad. class
hsperc high school percentile, from top
female =1 if female
white =1 if white
black =1 if black
hsizesq hsize^2
We export from Stat to csv
outsheet sat tothrs colgpa athlete verbmath hsize hsrank hsperc female white black hsizesq using ddata2.csv , comma nolabel
Now we have everything we need from Stata, and can say goodbye to it.
Then we import it to R
We use the same dataset as before.
#import ddata2 as gpa2
gpa2 <- read.csv("~/Dropbox/aktuellaKurser/JLAM/econDelad/Assignments/A3/ddata2.csv")
#library() used are
library(MASS) #for robust linear models rlm()
library(gmodels) #for CrossTable()
library(ggplot2) #for ggplot()
library(lmtest) #for bptest()
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
m1 <- lm(colgpa ~ tothrs, gpa2)
summary(m1)
##
## Call:
## lm(formula = colgpa ~ tothrs, data = gpa2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.55524 -0.46371 -0.01622 0.45131 1.45229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5201069 0.0182575 138.031 <2e-16 ***
## tothrs 0.0025094 0.0002873 8.735 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6527 on 4135 degrees of freedom
## Multiple R-squared: 0.01812, Adjusted R-squared: 0.01788
## F-statistic: 76.31 on 1 and 4135 DF, p-value: < 2.2e-16
For each increase in tothrs there is a 0.0025094 increase in colgpa, p-value < 1 %. The intercept of 2.5201069 allows us to draw our line of best fit, p-value < 1 %.
m1.res <- residuals(m1)
plot(gpa2$tothrs, m1.res, xlab="tothrs")
abline(0,0)
For tothrs > 30 it is homoscedastic. But for tothrs < 30 we see a greater variation, especially some negative values.
This makes sense, as the economical interpretation is that for those students who has less “total hours through fall semest”, 1 we do see that our model underpredics more often than for those studens who have more “total hours through fall semest”.
This is a common form of heterosked.
Do note that people seem to report whole numbers like 20 or 50 but rarely 22.
We test for heteroskedasticity using the Breusch-Pagan test.
#H0: errors are homosked
n <- nrow(gpa2)
m2 <- lm(colgpa ~ tothrs + sat, gpa2)
k.bptest <- 2 #number of variables
m2.res <- residuals(m2)
m2.res.sq <- m2.res^2 #step 1
Rsq_m2.res.sq <- summary(lm(m2.res.sq ~ tothrs + sat, gpa2))$r.sq #step2
numerator3 <- Rsq_m2.res.sq / k.bptest
denominator3 <- (1 - Rsq_m2.res.sq) / (n - k.bptest - 1)
Fobs.bptest <- numerator3 / denominator3
Fcrit.bptest <- qf(1-0.05, k.bptest, n - k.bptest - 1) #alpha = 5%
Fobs.bptest > Fcrit.bptest # reject H0 if TRUE
## [1] TRUE
c(Fobs.bptest, Fcrit.bptest) #comparing the two
## [1] 86.873652 2.997904
Our test rejects H0: Errors are homosked so we conclude that errors are heterosked.
Lets compare this with an premade function bptest() to check our calculations.
bptest(m2) #a Breusch-Pagan test.
##
## studentized Breusch-Pagan test
##
## data: m2
## BP = 166.86, df = 2, p-value < 2.2e-16
It also rejects H0 since p-value is low.
Since data is hetereosked we want to do a robust regression. We have a large sample (over 4000 observations) so with regard so sample-size we should be fine.
Robust regression is done by iterated re-weighted least squares (IRLS).
(Info about R: The command for running robust regression is rlm in the MASS package. There are several weighting functions that can be used for IRLS. We are going to use the Huber weights. We will then look at the final weights created by the IRLS process. See more at http://www.ats.ucla.edu/stat/r/dae/rreg.htm)
#we use rlm() instead of "reg colgpa tothrs, robust" in stata
m1.robust <- rlm(colgpa ~ tothrs, gpa2)
summary(m1.robust)
##
## Call: rlm(formula = colgpa ~ tothrs, data = gpa2)
## Residuals:
## Min 1Q Median 3Q Max
## -2.57006 -0.46256 -0.01603 0.45676 1.43647
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 2.5396 0.0188 135.4417
## tothrs 0.0022 0.0003 7.3842
##
## Residual standard error: 0.6839 on 4135 degrees of freedom
As a short comparison between the two can either be seen by looking back and forth between summary(m1) and summary(m1.robust) or below.
summary(m1)$coef[2,1] - summary(m1.robust)$coef[2,1] # higher coef for m1
## [1] 0.0003308867
summary(m1)$coef[2,2] - summary(m1.robust)$coef[2,2] # lower std err for m1
## [1] -7.753246e-06
summary(m1)$coef[2,3] - summary(m1.robust)$coef[2,3] # higher t for m1
## [1] 1.351103
Above we have answered “How do your standard errors and test statistics change?”. Now to the question “Why?”. Well, we have another calulation method and thus we get different resutls - but in what direction should the resutls shift? According to lecture 9 slide 13 “Robust standard errors are typically somewhat larger than non-robust standard errors, implying smaller t-statistics But robust standard errors can be smaller. Usually not a big diference”. So it is not clear that in general applying robust should always change the output in one way or another, but for our data, m1 compared to m1.robust had a higher coef, lower std.error. and a higher t-value.
We see that for our m1 the robust version have lower std.error which makes the t-statistic go up since t = coef / std.error.
Stata estimates from its robust regression are pasted below, estimates are slightly different. (For example m1.robust differs from stata on the coef - actually stata output are closer to m1 than m1.robust)
. reg colgpa tothrs, robust
Linear regression Number of obs = 4137
F( 1, 4135) = 77.08
Prob > F = 0.0000
R-squared = 0.0181
Root MSE = .65272
------------------------------------------------------------------------------
| Robust
colgpa | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tothrs | .0025094 .0002858 8.78 0.000 .001949 .0030698
_cons | 2.520107 .0207105 121.68 0.000 2.479503 2.560711
------------------------------------------------------------------------------
We regress
mfb <- lm(colgpa ~ female + black, gpa2)
summary(mfb)
##
## Call:
## lm(formula = colgpa ~ female + black, data = gpa2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.61122 -0.43122 -0.00757 0.44842 1.82842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.61122 0.01375 189.938 < 2e-16 ***
## female 0.14635 0.02023 7.233 5.58e-13 ***
## black -0.43964 0.04402 -9.988 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6472 on 4134 degrees of freedom
## Multiple R-squared: 0.03481, Adjusted R-squared: 0.03434
## F-statistic: 74.54 on 2 and 4134 DF, p-value: < 2.2e-16
Our model is
Colgpa = m + f*female + b*black
Possible buckets for the population wrt the two variables are
| gender color | black | non-black |
|---|---|---|
| male | bm | wm |
| female | bf | wf |
There is difference in gender, and difference in color.
Our baseline (\(\beta_0\) in the regression) is a non-black male.
| coef activated | black | female | interpretation | number |
|---|---|---|---|---|
| m | 0 | 0 | non-black male | one |
| m+b | 1 | 0 | black male | two |
| m+f | 0 | 1 | non-black female | three |
| m+f+b | 1 | 1 | black female | four |
we construct a table so that we compare black-non-black and female-male. notice
since b = color _ male = color _ female we see that indeed the regression makes it so that b is the effect of black==1 while keeping gender constant (as we have framed them “, given male” and “, given female”). in other words, b measures the difference in color, regardless of gender. with the exact same arguments, f is the difference in gender, regardless of color.
The regression output showed
Does the predicted GPA for black females correspond to the mean GPA for black females? Discuss why or why not.
Black females have both female=1 & black=1.
#Predicted value for black females
#get the values from our regression colgpa ~ female + black
mean_colgpa_black_females <- 2.61122 + 1*(0.14635) + 1*(-0.43964)
mean_colgpa_black_females
## [1] 2.31793
#Mean for black females
black_females <- subset(gpa2, gpa2$female==1 & gpa2$black==1)
mean(black_females$colgpa)
## [1] 2.325575
#actual mean - predicted mean
mean(black_females$colgpa) - mean_colgpa_black_females
## [1] 0.007645221
The actual mean is slighly higher. So the model underspredicts. We can understand this from looking at the residualplot, where we see there it is some residuals that have values less than -2 whereas no residual is greater than 2. (See also the residual quantiles in summary(mfb) that min(resid) = -2.6 and max(resid) = 1.8).
plot(resid(mfb)); abline(0,0); abline(-2,0)
The economical reason why the models underpredicts is because there might exist an interaction effect between being female and black. We investigate this below.
We regress
m3 <- lm(colgpa ~ female + black + female*black, gpa2)
summary(m3)
##
## Call:
## lm(formula = colgpa ~ female + black + female * black, data = gpa2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.61162 -0.43162 -0.00708 0.44442 1.83586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.61162 0.01392 187.555 < 2e-16 ***
## female 0.14546 0.02083 6.985 3.32e-12 ***
## black -0.44748 0.06169 -7.253 4.83e-13 ***
## female:black 0.01598 0.08806 0.181 0.856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6473 on 4133 degrees of freedom
## Multiple R-squared: 0.03481, Adjusted R-squared: 0.03411
## F-statistic: 49.69 on 3 and 4133 DF, p-value: < 2.2e-16
Our model is
Colgpa = m + f*female + b*black + i*female*black
Again, the possible buckets are:
| gender color | black | non-black |
|---|---|---|
| male | bm | wm |
| female | bf | wf |
There is difference in gender, and difference in color.
| coef activated | black | female | interpretation | number |
|---|---|---|---|---|
| m | 0 | 0 | non-black male | one |
| m+b | 1 | 0 | black male | two |
| m+f | 0 | 1 | non-black female | three’ |
| m+f+b+i | 1 | 1 | black female | four’ |
We construct a table so that we compare black-non-black and female-male. Notice
This follows the same example as lecture 8 slide 7 where we also have four different categories.
(We could se that i = b+i -b = color _ female - color_male and i = f+i -f = gender _ black - gender_non-black but we do not use this as we have already interpreted the four different categories there is.)
Does the predicted GPA for black females correspond to the mean GPA for black females? Discuss why or why not.
#Predicted value for black females
#get the values from our regression m3
mean_colgpa_black_females_2 <- 2.61122 + 1*(0.14546) + 1*(-0.44748) +
1*1*(0.01598)
mean_colgpa_black_females_2
## [1] 2.32518
#Mean for black females
mean(black_females$colgpa)
## [1] 2.325575
#actual mean - predicted mean
mean(black_females$colgpa) - mean_colgpa_black_females_2
## [1] 0.0003952212
c(mean_colgpa_black_females, mean_colgpa_black_females_2) #higher in 2 2 i.e. m3
## [1] 2.31793 2.32518
So the model with interactions (m3 i.e. colgpa ~ female + black + female * black) have higher prediction for mean colgpa of black females, than the model without interaction (colgpa ~ female + black).
Please note that the interaction effect is not stat. significant! We should thus not put big beliefs in m3 model.
The actual mean is slighly higher. The model underpredicts again. Again, we can understand this from looking at the residualplot, where we see there it is some residuals that have values less than -2 whereas no residual is greater than 2. (See also the residual quantiles in summary(m3) that min(m3) = -2.6 and max(resid) = 1.8). The residuals are actually the same
plot(resid(m3)); abline(0,0); abline(-2,0)
According to l8s13 and s14 we could test this two ways. We chose a Chow test.
We want to see if the relationship \(colgpa = sat\) are different for men and women.
Eriks SSR is the SS Residuals = Residuals SS = RSS in our terms (which usually is denoted SSE as in SS Errors).
We will now perform a Chow test. Suppose that we model our data as \(y_t=a+bx_{1t} + cx_{2t} + \varepsilon\,\) If we split our data into two groups, then we have \(y_t=a_1+b_1x_{1t} + c_1x_{2t} + \varepsilon. \,\) and \(y_t=a_2+b_2x_{1t} + c_2x_{2t} + \varepsilon. \,\) The null hypothesis of the Chow test is \(H_0: \; a_1 = a_2, b_1=b_2, c_1=c_2,\) and there is the assumption that the model errors \(\varepsilon\) are independent and identically distributed from a normal distribution with unknown variance.
Let \(S_C\) be the sum of squared residuals from the combined data, \(S_1\) be the sum of squared residuals from the first group, and \(S_2\) be the sum of squared residuals from the second group. \(N_1\) and \(N_2\) are the number of observations in each group and k is the total number of parameters (in this case 3 since we have a b and c). Then the Chow test statistic is \[F = \frac{(S_C -(S_1+S_2))/(k)}{(S_1+S_2)/(N_1+N_2-2k)}.\] The test statistic follows the F distribution with \(k\) and \(N_1\) + \(N_2-2k\) degrees of freedom. (These facts can for example be found on https://en.wikipedia.org/wiki/Chow_test)
(We use different notation than lecture 8 slide 14 but our k is his k+1 so he’s got another definition of k.)
k <- 2 #we have two paremeters, beta0 and beta1, in colgpa ~ sat.
n_f <- sum(gpa2$female) # number of females
n_m <- nrow(gpa2) - sum(gpa2$female) # number of males
m3 <- rlm(colgpa ~ sat, gpa2)
m3.male <- rlm(colgpa ~ sat, subset(gpa2, gpa2$female==0))
m3.female <- rlm(colgpa ~ sat, subset(gpa2, gpa2$female==1))
m3.res <- m3$residuals
m3.rss <- sum(m3.res^2)
m3.male.res <- m3.male$residuals
m3.male.rss <- sum(m3.male.res^2)
m3.female.res <- m3.female$residuals
m3.female.rss <- sum(m3.female.res^2)
numerator <- (m3.rss - (m3.male.rss + m3.female.rss) ) / k
denominator <- (m3.male.rss + m3.female.rss ) / (n_m + n_f - 2*k)
F_obs <- numerator / denominator
F_crit <- qf(1-0.01, k, n_f + n_m - 2*k) #alpha = 1 %
F_obs > F_crit #reject H0 if true
## [1] TRUE
c(F_obs, F_crit) #Fobs is much larger than Fcrit
## [1] 76.703294 4.610305
Since we reject the null, we do believe that there is strong evicende that the relationship betweed sat and colgpa are different for males and females.
We want to investigate the relationship between the variables \(colgpa\) and \(sat\). Above we have seen the following facts from each question:
colgpa ~ tothrs is significant (but have some bp weakness that errors are heterosked)mfb.mfb.m3.Regular transformations:
ffs <- lm(colgpa ~ sat, gpa2)
ffss <- lm(colgpa ~ sat^2, gpa2)
ffsl <- lm(colgpa ~ log(sat), gpa2) #be wary of log close to zero, gpa2)
#colgpa ~ sqrt(sat) #oftentimes a bit similar to log
We will use square and log transformations.
Placing dummies on sat can be done by:
ff4s <- lm(colgpa ~ sat + female, gpa2)
ff5s <- lm(colgpa ~ sat + black, gpa2)
ff6s <- lm(colgpa ~ sat + female + sat*female, gpa2)
ff7s <- lm(colgpa ~ sat + black + sat*black, gpa2)
ff8s <- lm(colgpa ~ sat + black + sat*black + female + sat*female, gpa2)
Note that we do not use for example colgpa ~ sat + black + sat*black + female + sat*female + sat*black*female beacuse it is we don’t want black*female since we estiablished above that it is not significant.
Adding tothrs
colgpa ~ sat + tothrs can be used.
ff4st <- lm(colgpa ~ sat + tothrs + female, gpa2)
ff5st <- lm(colgpa ~ sat + tothrs + black, gpa2)
ff6st <- lm(colgpa ~ sat + tothrs + female + sat*female, gpa2)
ff7st <- lm(colgpa ~ sat + tothrs + black + sat*black, gpa2)
ff8st <- lm(colgpa ~ sat + tothrs + black + sat*black + female + sat*female, gpa2)
Taking logs and squares
Besides ffs ffss ffsl we now also have the models: and ff4s ff5s ff6s ff7s ff8s and ff4st ff5st ff6st ff7st ff8st.
Now we take the models above and apply log or square.
ff4ss <- lm(colgpa ~ sat^2 + female, gpa2)
ff5ss <- lm(colgpa ~ sat^2 + black, gpa2)
ff6ss <- lm(colgpa ~ sat^2 + female + sat*female, gpa2)
ff7ss <- lm(colgpa ~ sat^2 + black + sat*black, gpa2)
ff8ss <- lm(colgpa ~ sat^2 + black + sat*black + female + sat*female, gpa2)
ff4stl <- lm(colgpa ~ log(sat) + tothrs + female, gpa2)
ff5stl <- lm(colgpa ~ log(sat) + tothrs + black, gpa2)
ff6stl <- lm(colgpa ~ log(sat) + tothrs + female + sat*female, gpa2)
ff7stl <- lm(colgpa ~ log(sat) + tothrs + black + sat*black, gpa2)
ff8stl <- lm(colgpa ~ log(sat) + tothrs + black + sat*black + female + sat*female, gpa2)
Thus we end up with 10 more models.
Note that for interpreting for example sat*female we need to have both sat and female to be stat. significant - otherwise we are not allowed to interpret sat*female.
We have built: ffs ffss ffsl ff4s ff5s ff6s ff7s ff8s ff4st ff5st ff6st ff7st ff8st ff4ss ff5ss ff6ss ff7ss ff8ss ff4stl ff5stl ff6stl ff7stl ff8stl
rsquares <- c(
summary(ffs)$adj.r.sq ,
summary(ffss)$adj.r.sq ,
summary(ffsl)$adj.r.sq ,
summary(ff4s)$adj.r.sq ,
summary(ff5s)$adj.r.sq ,
summary(ff6s)$adj.r.sq ,
summary(ff7s)$adj.r.sq ,
summary(ff8s)$adj.r.sq ,
summary(ff4st)$adj.r.sq , #9
summary(ff5st)$adj.r.sq ,
summary(ff6st)$adj.r.sq , #11
summary(ff7st)$adj.r.sq ,
summary(ff8st)$adj.r.sq , #13
summary(ff4ss)$adj.r.sq ,
summary(ff5ss)$adj.r.sq ,
summary(ff6ss)$adj.r.sq ,
summary(ff7ss)$adj.r.sq ,
summary(ff8ss)$adj.r.sq ,
summary(ff4stl)$adj.r.sq ,
summary(ff5stl)$adj.r.sq ,
summary(ff6stl)$adj.r.sq , #21
summary(ff7stl)$adj.r.sq ,
summary(ff8stl)$adj.r.sq) #23
rsquares
## [1] 0.1668443 0.1668443 0.1588596 0.1962999 0.1698046 0.1962051 0.1699332
## [8] 0.1994494 0.2123381 0.1879000 0.2122810 0.1879559 0.2154367 0.1962999
## [15] 0.1698046 0.1962051 0.1699332 0.1994494 0.2028373 0.1796000 0.2183375
## [22] 0.1942733 0.2233195
rsquares[c(9, 11, 13, 21, 23)]
## [1] 0.2123381 0.2122810 0.2154367 0.2183375 0.2233195
These models are ff4st ff6st ff8st ff6stl ff8stl. We prefer simple models over comples ones, so we delete ff8st and ff8stl because they use mode variables. We also prefer it if the residuals quantiles are “symmetric”. The three remaining models are compared below
library(stargazer)
##
## Please cite as:
##
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
stargazer(ff4st, ff6st, ff6stl, type = "html")
| Dependent variable: | |||
| colgpa | |||
| (1) | (2) | (3) | |
| log(sat) | -3.519*** | ||
| (0.612) | |||
| sat | 0.002*** | 0.002*** | 0.006*** |
| (0.0001) | (0.0001) | (0.001) | |
| female:sat | -0.00004 | ||
| (0.0001) | |||
| tothrs | 0.002*** | 0.002*** | 0.002*** |
| (0.0003) | (0.0003) | (0.0003) | |
| female | 0.223*** | 0.339** | 0.272* |
| (0.019) | (0.140) | (0.140) | |
| sat:female | -0.0001 | ||
| (0.0001) | |||
| Constant | 0.311*** | 0.264*** | 21.092*** |
| (0.072) | (0.091) | (3.626) | |
| Observations | 4,137 | 4,137 | 4,137 |
| R2 | 0.213 | 0.213 | 0.219 |
| Adjusted R2 | 0.212 | 0.212 | 0.218 |
| Residual Std. Error | 0.585 (df = 4133) | 0.585 (df = 4132) | 0.582 (df = 4131) |
| F Statistic | 372.661*** (df = 3; 4133) | 279.651*** (df = 4; 4132) | 232.057*** (df = 5; 4131) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
A quick look at their residual quantiles using summary() show no real difference there.
The models with the highest F-statistic, the lowest number of variables used, is model ff4st so we pick this one.
We chose model ff4st and it is summarized below
summary(ff4st)
##
## Call:
## lm(formula = colgpa ~ sat + tothrs + female, data = gpa2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.88317 -0.38386 0.01403 0.40784 1.79410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.105e-01 7.150e-02 4.343 1.44e-05 ***
## sat 2.054e-03 6.599e-05 31.129 < 2e-16 ***
## tothrs 2.377e-03 2.575e-04 9.229 < 2e-16 ***
## female 2.228e-01 1.851e-02 12.039 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5845 on 4133 degrees of freedom
## Multiple R-squared: 0.2129, Adjusted R-squared: 0.2123
## F-statistic: 372.7 on 3 and 4133 DF, p-value: < 2.2e-16
For simplicity of notation we let \(\varepsilon = e\). qq eller ska jag sätta minus för att vara konsistent med vad frågan säger och vad jag får ut?
We know that \(Var(ability) = 0.8\), \(Var(e) = 0.2\) and \(CoV(ability, e) = 0\).
Our model 1 is: \[colgpa = \beta_0 + \beta_1 ability + u \quad (1) \\ CoV(ability, u) = 0\] where \(u\) is the random error term.
This is our first model. But then we make some changes to id. Start by introducing \(ability = sat - e \quad (\gamma)\) so the true ability is affected by some random noise \(e\). So our model could be written \(colgpa = \beta_0 + \beta_1 (sat - e) + u \Leftrightarrow colgpa = \beta_0 + \beta_1 sat + v\) where \(v = u - \beta_1 e\).
Let’s further make the trivial assumption that \(E[e]=0\). We then make the crucial assumption - the classical errors in variables assumption (CEV) - that \(CoV(ability, e)=0.\) So to conclude, we have our model 2: \[colgpa = \beta_0 + \beta_1 sat + v \quad (2) \\ CoV(ability, e)=0.\]
From theory we know that this leads to two things.
As the last bit in this theory-part, let’s prove that \(CoV(sat, e) = Var(e)\) which is important because \(Var(e) > 0\) i.e. non-zero. We have that \(CoV(sat, e) = E[(sat) e] - E[sat]E[e] = \text{ / E[e]=0 / } = E[(sat) e] = \\ = E[(ability + e)e] = E[e(ability) + e^2] = E[e(ability)] +E[e^2]= \\ = \text{/ we now assume CoV[e(ability)] = 0 and since E[e]=0 this implies E[e(ability)] = 0 / } \\ = E[e^2] = \text{/ subtract zero squared /} = \\ = E[e^2] - (E[e])^2 = Var(e).\)
We run eivreg colgpa sat, r(sat .9) in STATA and compare it to reg colgpa sat. Output is below.
. eivreg colgpa sat, r(sat .8)
assumed Errors-in-variables regression
variable reliability
---------------------------- Number of obs = 4137
sat 0.8000 F( 1, 4135) = 873.03
* 1.0000 Prob > F = 0.0000
R-squared = 0.2088
Root MSE = .58592
------------------------------------------------------------------------------
colgpa | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sat | .0024138 .0000817 29.55 0.000 .0022537 .002574
_cons | .1656496 .0846635 1.96 0.050 -.0003365 .3316356
------------------------------------------------------------------------------
. reg colgpa sat
Source | SS df MS Number of obs = 4137
-------------+------------------------------ F( 1, 4135) = 829.26
Model | 299.712725 1 299.712725 Prob > F = 0.0000
Residual | 1494.48295 4135 .36142272 R-squared = 0.1670
-------------+------------------------------ Adj R-squared = 0.1668
Total | 1794.19567 4136 .433799728 Root MSE = .60118
------------------------------------------------------------------------------
colgpa | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sat | .0019311 .0000671 28.80 0.000 .0017996 .0020625
_cons | .6630568 .0697213 9.51 0.000 .5263656 .799748
The results are (for the eivreg)
Compared to (for the usal reg)
Let us compare the theory to the output we have, with regards to the two claimed results.
Standard error for the sat coef went from .0000671 in reg to .0000817 in eivreg just as theory predicted the Std.Error was higher.
In our case the reliability ratio is \(r=0.8\). So (the coef from model 2) * (.8) = (coef from model 1). Or stated differenlty: (the coef from eivreg) * (.8) = (coef from reg) and this is observed in the Stata output where we see that coef from eivreg = .0019311 and coef from reg = .0019311.
(As a final check, if we did not know the \(r\) value we could still make a reality check: we know \(\beta_{CEV}^{sat} = .0024138\) and \(\beta_{OLS}^{sat} = .0019311\) and given that \(r \leq 1\) we can conclude that if the coef is positive then we should observe \(\beta_{OLS}^{sat} < \beta_{CEV}^{sat}\) and for us this is indeed the case since \(.0019311\) < .0024138$.)
Do note that we compared the eivreg coef output from stata with the reg output from stata just to make it easy for the eyes - we could have chose to compare eivreg to R output as well. A quick look at anova(lm(colgpa ~ sat, gpa2)) show us that it has the same F-value, and from summary((lm(colgpa ~ sat, gpa2))) we see that they have the same coef and t-stat, only with other precision in their number of digits.
summary(lm(colgpa ~ sat, gpa2))$coef #coef in R regression
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.663056784 6.972127e-02 9.510108 3.134626e-21
## sat 0.001931058 6.705796e-05 28.796842 2.322657e-166
# std error of sat in stata is 0.0000671
0.0000671 - summary(lm(colgpa ~ sat, gpa2))$coef[2,2]
## [1] 4.20383e-08
The difference is tiny since 6.705796e-05 ≈ 6.71e-05 = 0.0000671 so R uses more digits, that’s all.
We have a measurement error in the dependent variable colgpa.
People want to impress, so they are more likely to round up their estimates. A true 3.75 might be reported as a 4.0 GPA and a true 4.3 as a 4.5 GPA. Let’s call this bragging.
Denote observed gpa as \(gpa\). Denote true gpa as \(gpa^*\). If people brag about their GPA, then we expect that \(gpa^* \leq gpa\) for almost all students and that on average \(gpa^* < gpa\). If we define \(e_0 = gpa - gpa^*\) then \(E[e_0] > 0\).
How would this affect regression estimates? For simplicity of notation, we keep the discussion to the regression colgpa ~ sat.
We know from lecture 10 slide 5 that if we assume \(CoV(sat, e_0) = 0\) then OLS is consistent. But since \(CoV(sat, e_0) = E[sat \cdot e_0] - E[sat]E[e_0]\) this assumptions is really two assumptions in one and we split them up in two parts
The first assumption is trivial since if this hold we would only re-interpret out estimates in a very fast way, as shown in lecture 2 slides 10 to 14 (especially slide 12).
The second assumption is the important one, since this fact is used in the derivation of the OLS estimators.
In our regression colgpa ~ sat we still assume that \(E[sat \cdot e_0] = 0\) (so the important assumption hold) but we allow for the trivial assumption to be violated; in fact this vialotation is what the question asks us for since bragging is going on. The fact that the important assumption hold makes the OLS estimates consistent (just as lecture 10 slide 5 say they are), but the violation of the trivial assumptions makes us need to re-intrepret the intercept-coef \(\beta_0\) (just as lecture 2 slide 12 tells us). Regression of the usual form is \(gpa^* = \beta_0 + \beta_1 sat + u\) where \(E[u]=0\). But using our updated model with bragging we get \(gpa = \beta_0' + \beta_1' sat + u + e_0.\) By just re-arranging the latter and taking expectation we get \(\beta_0' = E[gpa] - \beta_1 E[sat] - E[u] - E[e_0]\) where \(E[u]=0\) from before and since \(E[e_0] > 0\) due to bragging we conclude that \(\beta_0 > \beta_0'\). So if we believe bragging is present then the intercept of the regression colgpa ~ sat gets smaller. But we rarely interpret the intercept coef.
What could be the determinants of a student becoming an athlete in college?
A logistic regression is perfect for this setting, but we will use the probabilty model since that is what is asked for.
Let x \(= x_1, ..., x_k\). Athlete is a dummy variable, that only takes on 0 or 1. Since E[athlete | x] = P[athlete = 1 | x] we can use the linear probabilty model.
What should we put in x? Do note that most students that are athletes in college probably are athletes before they go to college. This is important background information for our analysis. Furthermore, many colleges have lower levels of required high school gpa (as reflected by the variables hsperc and hsrank) since begin an athelte can put you in a different bucket where you compete with fewer number of students and hence a lower grade is required.
What is a criteron for a “good” model? There are a lot of criterons, for example
How do we do this in practice? We might split our data into 30 % observations into a “training-set”, then setup a stepwise selection process, and then validate the chosen model against the other 70 % of the data “testing-set”.
We will, however, first reason us through what varaibles our model should contain - based on “economist-intuition” and a few checkups on the data - and then create a model and evaluate it.
Formulate a linear probability model to answer this question using the variables available in the dataset, as well as suitable transformations of those variables.
The possible variables are
sat combined SAT score
tothrs total hours through fall semest
colgpa GPA after fall semester
athlete =1 if athlete
verbmath verbal/math SAT score
hsize size grad. class, 100s
hsrank rank in grad. class
hsperc high school percentile, from top
female =1 if female
white =1 if white
black =1 if black
hsizesq hsize^2
Let’s see how hsize, hsrank, hsperc look.
range(gpa2$hsize)
## [1] 0.03 9.40
range(gpa2$hsrank)
## [1] 1 634
range(gpa2$hsperc)
## [1] 0.1666667 92.0000000
gpa2[1:9, c(6,7,8)]
## hsize hsrank hsperc
## 1 0.10 4 40.00000
## 2 9.40 191 20.31915
## 3 1.19 42 35.29412
## 4 5.71 252 44.13310
## 5 2.14 86 40.18692
## 6 2.68 41 15.29851
## 7 3.11 161 51.76849
## 8 2.68 101 37.68657
## 9 3.67 161 43.86921
Now we understand there variables better, which is important in the model selection. It seems like they measure class size (hsize) and students’ performance relative to the class (hsrank) or performance relative to the school (hsperc).
Let us get to the model by thinking about (and guessing) which variables should be important.
The number of athletes in our dataset is important because if this number is too small, say 9 athletes, then it is very hard for us to make any generalizations or predictions.
sum(gpa2$athlete==1)
## [1] 194
Our guess would be to include
Let us investigate these guesses with some tables, which are facts from the sample we have (and we try to use this sample to say something about the general population).
CrossTable(gpa2$athlete, gpa2$female)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4137
##
##
## | gpa2$female
## gpa2$athlete | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 2128 | 1815 | 3943 |
## | 0.821 | 1.006 | |
## | 0.540 | 0.460 | 0.953 |
## | 0.935 | 0.976 | |
## | 0.514 | 0.439 | |
## -------------|-----------|-----------|-----------|
## 1 | 149 | 45 | 194 |
## | 16.696 | 20.439 | |
## | 0.768 | 0.232 | 0.047 |
## | 0.065 | 0.024 | |
## | 0.036 | 0.011 | |
## -------------|-----------|-----------|-----------|
## Column Total | 2277 | 1860 | 4137 |
## | 0.550 | 0.450 | |
## -------------|-----------|-----------|-----------|
##
##
149/2277 #fraction of male that are athletes
## [1] 0.06543698
45/1860 #fraction of females that are athletes
## [1] 0.02419355
149/194 #of those who are athletes, approx 3/4 are male.
## [1] 0.7680412
Conclusion: An athlete is more likely to be male, since 3/4 of athletes are male.
Conclusion: Amongst males, being an athlete is more usual than among females.
So we include the variable female.
#ftable(xtabs(~athlete + black, gpa2))
CrossTable(gpa2$athlete, gpa2$black)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4137
##
##
## | gpa2$black
## gpa2$athlete | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 3758 | 185 | 3943 |
## | 0.297 | 5.069 | |
## | 0.953 | 0.047 | 0.953 |
## | 0.962 | 0.808 | |
## | 0.908 | 0.045 | |
## -------------|-----------|-----------|-----------|
## 1 | 150 | 44 | 194 |
## | 6.037 | 103.021 | |
## | 0.773 | 0.227 | 0.047 |
## | 0.038 | 0.192 | |
## | 0.036 | 0.011 | |
## -------------|-----------|-----------|-----------|
## Column Total | 3908 | 229 | 4137 |
## | 0.945 | 0.055 | |
## -------------|-----------|-----------|-----------|
##
##
44/229 #fraction of atheltes among black
## [1] 0.1921397
150/3908 #fraction of atheltes among nonblack
## [1] 0.0383828
Conclusion: Among blacks, being an athlete is more usual than among non-blacks.
#ftable(xtabs(~athlete + white, gpa2))
CrossTable(gpa2$athlete, gpa2$white)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4137
##
##
## | gpa2$white
## gpa2$athlete | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 255 | 3688 | 3943 |
## | 5.064 | 0.407 | |
## | 0.065 | 0.935 | 0.953 |
## | 0.828 | 0.963 | |
## | 0.062 | 0.891 | |
## -------------|-----------|-----------|-----------|
## 1 | 53 | 141 | 194 |
## | 102.928 | 8.279 | |
## | 0.273 | 0.727 | 0.047 |
## | 0.172 | 0.037 | |
## | 0.013 | 0.034 | |
## -------------|-----------|-----------|-----------|
## Column Total | 308 | 3829 | 4137 |
## | 0.074 | 0.926 | |
## -------------|-----------|-----------|-----------|
##
##
141/3829 #fraction of atheltes among white
## [1] 0.03682424
53/308 #fraction of atheltes among nonwhite
## [1] 0.1720779
Conclusion: Among whites, being an athlete is less usual than among non-whites.
#but is black and white measuring the same thing?
#turns out - it almost does.
nrow(subset(gpa2, gpa2$black==0 & gpa2$white==0)) #neither black nor white
## [1] 79
# this group is very small in our dataset of 4000 students
nrow(subset(gpa2, gpa2$black==0 & gpa2$white==0)) / nrow(gpa2)
## [1] 0.01909596
Conclusion: Here we see that black and white are almost measuring the same thing. For practical purposes, we will therefore not include both black and white in our model, but instead stick to only black. (We could have chosen only white, of course.)
In summary, we include the variable black but not the variable white.
The variable tothrs is purely based on beliefs and not on tabulated data. If tothrs is important for becoming an athlete, the distribution of tothrs should not look the same for athletes and non-athletes.
ggplot(gpa2, aes(tothrs)) +
geom_histogram(fill = "white", color = "black") +
facet_wrap(~athlete, scales = "free_y", ncol=1) +
theme_bw() +
ggtitle('Similarities:
Distribution for Non-athletes (left) vs Distribution for athletes (right)' )
Indeed, our eyes cannot distinguish a difference in the distributions. The look pretty much the same.
For these two reasons, we do not include the variable tothrs.
Ee might want to use
Should we include only colgpa or only sat or both? They might measure the same thing namely how good you are in school. Let’s plot.
ggplot(gpa2, aes(colgpa)) +
geom_histogram(fill = "white", color = "black") +
facet_wrap(~athlete, scales = "free_y", ncol=1) +
theme_bw() +
ggtitle("colgpa distribution")
Clearly, there is a difference. And by the same argument as before, this might be reason to include colgpa. We also see that it looks as if athletes get lower grades.
ggplot(gpa2, aes(sat)) +
geom_histogram(fill = "white", color = "black") +
facet_wrap(~athlete, scales = "free_y", ncol=1) +
theme_bw() +
ggtitle("sat distribution")
cor(gpa2$sat, gpa2$colgpa) # correlation
## [1] 0.4087123
The difference between athlete=0 and athelte=1 is not as clear for sat as for colgpa, and by looking at the number of counts on the y-axis we see that in the sat-distribution the count are around 20 for colgpa but only around 10 for sat (by eyeballing them) so we don’t believe there is enough data to claim that the distribution for sat looks different, whereas we do believe there is enough data to say that the distribution for colgpa looks different.
Furthermore, the correlation of \(0.41\) indicates that including both might introduce some multicollineraty (although not a huge problem since cut-off points are usually set higher; around 0.70 is one rule of thumb according to Inger Persson at Uppsala Statistiska Instution). Because they are correlated and because the second plot did not show a lot of difference, we will exclude it from our model.
To summarize, we include colgpa but not sat.
As we argued earlier, the application and selection process makes us believe hsperc is an improtant variable to include. Athletes compete on slightly different high school merits than non-athletes does.
We see no reason to include more variables. We already have the ones we think are important. Furthermore, adding more variables would make the model more complex which is not in line with our goals of creating a simple model.
We did not use sat because of an argument regarding sat. And we do not use verbmath since we think sat is a better variable than verbmath, but sat was not good enough to include.
We did include hsperc. But hsize and hsrank are also related to high school which we argued earlier is important in finding atheltes among college students due to the selection process. So why not include hsize and hsrank? We don’t include hsize because we see no econmical reason. We don’t include hsrank since it almost measure the same thing as hsperc (and they have a large correlation of cor(gpa2$hsrank, gpa2$hsperc) = 0.61) so we don’t want to include both of them, and since we think hsperc have an easier interpretation we chose hsperc.
The variables we have chose are female, black, colgpa, hsperc. They measure gender, race, grades in college, grades in high school. We think they make sense to distinguish athletes from non-athletes.
am1 <- lm(athlete ~ female + black + colgpa + hsperc, gpa2)
summary(am1)
##
## Call:
## lm(formula = athlete ~ female + black + colgpa + hsperc, data = gpa2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34850 -0.06539 -0.02596 -0.00155 1.03138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0347003 0.0171984 -2.018 0.0437 *
## female -0.0309739 0.0064241 -4.821 1.48e-06 ***
## black 0.1699999 0.0140753 12.078 < 2e-16 ***
## colgpa 0.0121914 0.0054093 2.254 0.0243 *
## hsperc 0.0027951 0.0002138 13.075 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.203 on 4132 degrees of freedom
## Multiple R-squared: 0.07904, Adjusted R-squared: 0.07815
## F-statistic: 88.66 on 4 and 4132 DF, p-value: < 2.2e-16
comment on expected signs of the coefficients,
Interpretation of coef sign (positive or negative)
Ordering variables by their coef magnitude (largest first)
This comparison show us which variables we are the most relevant to us (there is a difference between statistical significance and practical relevance - if the coef value is 0.0000001 the effect is negligible).
Ordering variables by their coef statistical significance (largest t-value first)
This comparison show us which variables we are the most sure of: we are really sure hsperc and black is indeed affecting athlete=1 whereas we are not as sure female does, and even less sure that colgpa does. However, they are all stat significant (the first three at alpha = 0.001 and the fourth at alpha = 0.05) so this comparison is not super relevant.
estimate the model
Our estimated model is
\[athlete = - 0.0347003 - 0.0309739 female + 0.1699999 black + 0.0121914 colgpa + 0.0027951 hsperc.\]
and interpret the results.
If someone is a female it is 0.0309739 less likely they are an athlete. If someone is black it is 0.1699999 more likely they are an athlete. For each jump in colgpa (remember it takes on 1:4) it is 0.0121914 more likely they are an athlete. For each higher value of hsperc it is 0.0027951 more likely you are an athlete.
Remember that we are modelling a probability.
By definition a probility is in the interval [0,1]. Let us maximize the variables allowed value and see if the model will predict a probability larger than 1. Let us minimze the variables allowed value and see if the model will predict a probability smaller than 0.
am1.coef <- coefficients(am1)
maximize <-
am1.coef[1] + #intercept
am1.coef[2]*0 + #female
am1.coef[3]*1 + #black
am1.coef[4]*max(gpa2$colgpa) +
am1.coef[5]*max(gpa2$hsperc)
minimize <-
am1.coef[1] + #intercept
am1.coef[2]*1 + #female
am1.coef[3]*0 + #black
am1.coef[4]*min(gpa2$colgpa) +
am1.coef[5]*min(gpa2$hsperc)
c(minimize > 0, maximize < 1) # want to return true true
## (Intercept) (Intercept)
## FALSE TRUE
c(minimize, maximize)
## (Intercept) (Intercept)
## -0.06520837 0.44121610
Since \(minimize = -0.065\) we have a small problem in the model that should be fixed, and it can be so using a transformation, for example using logistic regression so that we force \(0<y<1\) or another transformation. We skip this.
When we have heteresked, \(\hat \beta_j\) is still unbiased but \(Var(\hat \beta_j)\) is biased and this leads to problems in hypothesis testing. One solution is to use robust standard errors.
If the errors are not homosked we will use robust standard errors. (In this case WLS also be used, but robust is simpler.)
Is data homosked or heterosked?
model5.2 <- lm(colgpa ~ female + black + hsperc, gpa2)
bptest( model5.2 ) #test H0: data i homosked
##
## studentized Breusch-Pagan test
##
## data: model5.2
## BP = 12.125, df = 3, p-value = 0.006967
We reject H0 and thus believe data is heterosked.
Since we used lm() vs rlm() in a previous question we here use the Stata commands reg athlete colgpa female black hsperc, robust and reg athlete colgpa female black hsperc to get the output below.
. reg athlete colgpa female black hsperc, robust
Linear regression Number of obs = 4137
F( 4, 4132) = 31.00
Prob > F = 0.0000
R-squared = 0.0790
Root MSE = .20301
-----------------------------------------------------------------------------
| Robust
athlete | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------
colgpa | .0121914 .0049482 2.46 0.014 .0024902 .0218926
female | -.0309739 .0058932 -5.26 0.000 -.0425278 -.01942
black | .1699999 .0248847 6.83 0.000 .1212125 .2187872
hsperc | .0027951 .0003168 8.82 0.000 .002174 .0034162
_cons | -.0347003 .0171553 -2.02 0.043 -.0683339 -.0010667
-----------------------------------------------------------------------------
. reg athlete colgpa female black hsperc
Source | SS df MS Number of obs = 4137
-------------+------------------------------ F( 4, 4132) = 88.66
Model | 14.6154553 4 3.65386382 Prob > F = 0.0000
Residual | 170.287131 4132 .041211794 R-squared = 0.0790
-------------+------------------------------ Adj R-squared = 0.0782
Total | 184.902586 4136 .044705654 Root MSE = .20301
----------------------------------------------------------------------------
athlete | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+--------------------------------------------------------------
colgpa | .0121914 .0054093 2.25 0.024 .0015863 .0227966
female | -.0309739 .0064241 -4.82 0.000 -.0435687 -.0183791
black | .1699999 .0140753 12.08 0.000 .1424047 .197595
hsperc | .0027951 .0002138 13.07 0.000 .002376 .0032142
_cons | -.0347003 .0171984 -2.02 0.044 -.0684184 -.0009822
-----------------------------------------------------------------------------
Notably F statistic is 31 (robust) vs 88. So we are not as sure that our model is a good fit of the data when using robust. We also note a big difference in the t-statistic for black and hsperc in robust vs non-robust.
By using robust we are more critical to our data, and therefore we usually get that “Robust standard errors are typically somewhat larger than non-robust standard errors, implying smaller t-statistics. But robust standard errors can be smaller. Usually not a big diference”. Compare the two outputs to see how they differ; a quick comparison is done below.
Robust
Std. Err. Std. Err.
---------- ---------
.0049482 < .0054093
.0058932 < .0064241
.0248847 > .0140753
.0003168 > .0002138
.0171553 < .0171984
So it is not clear that in general applying robust should always change the output in one way or another, and for our data, the result is summarized above.
Even though the difference is not huge, standardized should be used because they are better in theory.
The variable tothrs measure “total hours through fall semest”.↩
If we have the quicker notation \(x_1 = x_1^* + e\) the PV case can be described as assuming \(Cov(x_1, e)=0\) (PV) rather than assuming \(Cov(x_1^*, e)=0\) (CEV) as we do in this assignment. Lindqvist explicitly derives the result of higher standard errors only in the PV case, so let us reason through why this must also hold in the CEV case. The formulas tells us that in a SLR \(Var( \hat \beta_1 ) = Var(v) / SSTx\) and \(Var(v) = Var(u - \beta_1 e) = Var(u) + \beta_1^2 Var(e) - 2 \beta_1 CoV(u,e)\), and since on lecture 10 slide 5 the assumption that \(e\) and \(u\) are independent was made so we will make the same assumption, which leads us to \(Var( \hat \beta_1 ) = Var(u) + \beta_1^2 Var(e) + 0 > Var(u)\). (Do note that we make the assumption of \(Cov(u,e) = 0\) although we could arrive at the same conclusion with a slightly weaker assumption that \(\beta_1^2 Var(e) - 2 \beta_1 CoV(u,e) > 0\).)↩
This follows if we look at the equation above containing \(\operatorname{plim}\) and reminding ourselves of the fact that model1coef is unbiased.↩