Note on Stata and R

We use R, knitr and Markdown.

This document can be read from any device as an HTML-version on http://rpubs.com/lindberg/115410

Note that the document, thanks to knitr magic, will be written in the following way. First we talk a bit, like in this paragraph. Then to find answer a question, for example “How do you add 1 and 2?” we write some code and under the code we print the output. So here comes the code and the output

1+2

## [1] 3

Note that the [1] indicates that this is the first line output. If there are more lines, a [2] would be added to row 2 of the ouput

Stata to R

Here we

Import to stata.
Get the Variable descriptions.
csv export.
Import to R.

We load the data into Stata using

use "/Users/jacoblindberg/Dropbox/pathtofile/gpa2.dta"
describe

Running describe in Stata give us an output where we learn the definitions of the variables.

sat       combined SAT score
tothrs    total hours through fall semest
colgpa    GPA after fall semester
athlete   =1 if athlete
verbmath  verbal/math SAT score
hsize     size grad. class, 100s
hsrank    rank in grad. class
hsperc    high school percentile, from top
female    =1 if female
white     =1 if white
black     =1 if black
hsizesq   hsize^2

We export from Stat to csv

outsheet sat tothrs colgpa athlete verbmath hsize hsrank hsperc female white black hsizesq using ddata2.csv , comma nolabel

Now we have everything we need from Stata, and can say goodbye to it.

Then we import it to R

We use the same dataset as before.

#import ddata2 as gpa2
gpa2 <- read.csv("~/Dropbox/aktuellaKurser/JLAM/econDelad/Assignments/A3/ddata2.csv")
#library() used are
library(MASS)   #for robust linear models rlm()
library(gmodels) #for CrossTable()
library(ggplot2) #for ggplot()
library(lmtest) #for bptest()

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

1 Heteroskedasticity

1.1)

m1 <- lm(colgpa ~ tothrs, gpa2)
summary(m1)

## 
## Call:
## lm(formula = colgpa ~ tothrs, data = gpa2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.55524 -0.46371 -0.01622  0.45131  1.45229 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.5201069  0.0182575 138.031   <2e-16 ***
## tothrs      0.0025094  0.0002873   8.735   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6527 on 4135 degrees of freedom
## Multiple R-squared:  0.01812,    Adjusted R-squared:  0.01788 
## F-statistic: 76.31 on 1 and 4135 DF,  p-value: < 2.2e-16

For each increase in tothrs there is a 0.0025094 increase in colgpa, p-value < 1 %. The intercept of 2.5201069 allows us to draw our line of best fit, p-value < 1 %.

1.2)

m1.res <- residuals(m1)
plot(gpa2$tothrs, m1.res, xlab="tothrs")
abline(0,0)

For tothrs > 30 it is homoscedastic. But for tothrs < 30 we see a greater variation, especially some negative values.

This makes sense, as the economical interpretation is that for those students who has less “total hours through fall semest”, ¹ we do see that our model underpredics more often than for those studens who have more “total hours through fall semest”.

This is a common form of heterosked.

Do note that people seem to report whole numbers like 20 or 50 but rarely 22.

1.3)

We test for heteroskedasticity using the Breusch-Pagan test.

#H0: errors are homosked
n <- nrow(gpa2)
m2 <- lm(colgpa ~ tothrs + sat, gpa2)
k.bptest <- 2 #number of variables
m2.res <- residuals(m2)
m2.res.sq <- m2.res^2 #step 1
Rsq_m2.res.sq <- summary(lm(m2.res.sq ~ tothrs + sat, gpa2))$r.sq #step2
numerator3 <- Rsq_m2.res.sq / k.bptest
denominator3 <- (1 - Rsq_m2.res.sq) / (n - k.bptest - 1)
Fobs.bptest <- numerator3 / denominator3
Fcrit.bptest <- qf(1-0.05, k.bptest, n - k.bptest - 1) #alpha = 5%
Fobs.bptest > Fcrit.bptest # reject H0 if TRUE

## [1] TRUE

c(Fobs.bptest, Fcrit.bptest) #comparing the two

## [1] 86.873652  2.997904

Our test rejects H0: Errors are homosked so we conclude that errors are heterosked.

For fun: Built in function bptest()

Lets compare this with an premade function bptest() to check our calculations.

bptest(m2) #a Breusch-Pagan test.

## 
##  studentized Breusch-Pagan test
## 
## data:  m2
## BP = 166.86, df = 2, p-value < 2.2e-16

It also rejects H0 since p-value is low.

1.4)

Since data is hetereosked we want to do a robust regression. We have a large sample (over 4000 observations) so with regard so sample-size we should be fine.

Robust regression is done by iterated re-weighted least squares (IRLS).

(Info about R: The command for running robust regression is rlm in the MASS package. There are several weighting functions that can be used for IRLS. We are going to use the Huber weights. We will then look at the final weights created by the IRLS process. See more at http://www.ats.ucla.edu/stat/r/dae/rreg.htm)

#we use rlm() instead of "reg colgpa tothrs, robust" in stata
m1.robust <- rlm(colgpa ~ tothrs, gpa2)
summary(m1.robust)

## 
## Call: rlm(formula = colgpa ~ tothrs, data = gpa2)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57006 -0.46256 -0.01603  0.45676  1.43647 
## 
## Coefficients:
##             Value    Std. Error t value 
## (Intercept)   2.5396   0.0188   135.4417
## tothrs        0.0022   0.0003     7.3842
## 
## Residual standard error: 0.6839 on 4135 degrees of freedom

As a short comparison between the two can either be seen by looking back and forth between summary(m1) and summary(m1.robust) or below.

summary(m1)$coef[2,1] - summary(m1.robust)$coef[2,1] # higher coef for m1

## [1] 0.0003308867

summary(m1)$coef[2,2] - summary(m1.robust)$coef[2,2] # lower std err for m1

## [1] -7.753246e-06

summary(m1)$coef[2,3] - summary(m1.robust)$coef[2,3] # higher t for m1

## [1] 1.351103

Above we have answered “How do your standard errors and test statistics change?”. Now to the question “Why?”. Well, we have another calulation method and thus we get different resutls - but in what direction should the resutls shift? According to lecture 9 slide 13 “Robust standard errors are typically somewhat larger than non-robust standard errors, implying smaller t-statistics But robust standard errors can be smaller. Usually not a big diference”. So it is not clear that in general applying robust should always change the output in one way or another, but for our data, m1 compared to m1.robust had a higher coef, lower std.error. and a higher t-value.

We see that for our m1 the robust version have lower std.error which makes the t-statistic go up since t = coef / std.error.

Stata vs R

Stata estimates from its robust regression are pasted below, estimates are slightly different. (For example m1.robust differs from stata on the coef - actually stata output are closer to m1 than m1.robust)


. reg colgpa tothrs, robust

Linear regression                                      Number of obs =    4137
                                                       F(  1,  4135) =   77.08
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.0181
                                                       Root MSE      =  .65272

------------------------------------------------------------------------------
             |               Robust
      colgpa |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      tothrs |   .0025094   .0002858     8.78   0.000      .001949    .0030698
       _cons |   2.520107   .0207105   121.68   0.000     2.479503    2.560711
------------------------------------------------------------------------------

2 Interaction Terms

2.1)

We regress

mfb <- lm(colgpa ~ female + black, gpa2)
summary(mfb)

## 
## Call:
## lm(formula = colgpa ~ female + black, data = gpa2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.61122 -0.43122 -0.00757  0.44842  1.82842 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.61122    0.01375 189.938  < 2e-16 ***
## female       0.14635    0.02023   7.233 5.58e-13 ***
## black       -0.43964    0.04402  -9.988  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6472 on 4134 degrees of freedom
## Multiple R-squared:  0.03481,    Adjusted R-squared:  0.03434 
## F-statistic: 74.54 on 2 and 4134 DF,  p-value: < 2.2e-16

Our model is

Colgpa = m + f*female + b*black

Possible buckets for the population wrt the two variables are

gender color	black	non-black
male	bm	wm
female	bf	wf

There is difference in gender, and difference in color.

Our baseline ($\beta_0$ in the regression) is a non-black male.

coef activated	black	female	interpretation	number
m	0	0	non-black male	one
m+b	1	0	black male	two
m+f	0	1	non-black female	three
m+f+b	1	1	black female	four

we construct a table so that we compare black-non-black and female-male. notice

two - one = b = black male - non-black male = diff between black and non-black, given male = color_male
three - one = f = non-black female - non-black male = diff between female and male, given non-black = gender_non-black
four - two = f = black female - black male = diff between female and male, given black = gender_black
four - three = b = black female - non-black female = diff between black and non-black, given male = color_female

since b = color _ male = color _ female we see that indeed the regression makes it so that b is the effect of black==1 while keeping gender constant (as we have framed them “, given male” and “, given female”). in other words, b measures the difference in color, regardless of gender. with the exact same arguments, f is the difference in gender, regardless of color.

The regression output showed

$f > 0$ so females have higher colgpa “on average” loosely speaking
$b < 0$ so blacks have lower colgpa “on average” loosely speaking

Does the predicted GPA for black females correspond to the mean GPA for black females? Discuss why or why not.

Black females have both female=1 & black=1.

#Predicted value for black females
#get the values from our regression colgpa ~ female + black
mean_colgpa_black_females <- 2.61122 + 1*(0.14635) + 1*(-0.43964)
mean_colgpa_black_females

## [1] 2.31793

#Mean for black females
black_females <- subset(gpa2, gpa2$female==1 & gpa2$black==1)
mean(black_females$colgpa)

## [1] 2.325575

#actual mean - predicted mean
mean(black_females$colgpa) - mean_colgpa_black_females

## [1] 0.007645221

The actual mean is slighly higher. So the model underspredicts. We can understand this from looking at the residualplot, where we see there it is some residuals that have values less than -2 whereas no residual is greater than 2. (See also the residual quantiles in summary(mfb) that min(resid) = -2.6 and max(resid) = 1.8).

plot(resid(mfb)); abline(0,0); abline(-2,0)

The economical reason why the models underpredicts is because there might exist an interaction effect between being female and black. We investigate this below.

2.2)

We regress

m3 <- lm(colgpa ~ female + black + female*black, gpa2)
summary(m3)

## 
## Call:
## lm(formula = colgpa ~ female + black + female * black, data = gpa2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.61162 -0.43162 -0.00708  0.44442  1.83586 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.61162    0.01392 187.555  < 2e-16 ***
## female        0.14546    0.02083   6.985 3.32e-12 ***
## black        -0.44748    0.06169  -7.253 4.83e-13 ***
## female:black  0.01598    0.08806   0.181    0.856    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6473 on 4133 degrees of freedom
## Multiple R-squared:  0.03481,    Adjusted R-squared:  0.03411 
## F-statistic: 49.69 on 3 and 4133 DF,  p-value: < 2.2e-16

Our model is

Colgpa = m + f*female + b*black + i*female*black

Again, the possible buckets are:

gender color	black	non-black
male	bm	wm
female	bf	wf

There is difference in gender, and difference in color.

coef activated	black	female	interpretation	number
m	0	0	non-black male	one
m+b	1	0	black male	two
m+f	0	1	non-black female	three’
m+f+b+i	1	1	black female	four’

We construct a table so that we compare black-non-black and female-male. Notice

two - one = b = black male - non-black male = diff between black and non-black, given male = color_male = 0.14546
three - one = f = non-black female - non-black male = diff between female and male, given non-black = gender_non-black = -0.44748
four’ - two = f+i = black female - black male = diff between female and male, given black = gender_black = 0.14546+0.01598=0.16144
four’ - three = b+i = black female - non-black female = diff between black and non-black, given male = color_female = -0.44748+0.01598 = -0.4315.

This follows the same example as lecture 8 slide 7 where we also have four different categories.

(We could se that i = b+i -b = color _ female - color_male and i = f+i -f = gender _ black - gender_non-black but we do not use this as we have already interpreted the four different categories there is.)

Does the predicted GPA for black females correspond to the mean GPA for black females? Discuss why or why not.

#Predicted value for black females
#get the values from our regression m3
mean_colgpa_black_females_2 <- 2.61122 + 1*(0.14546) + 1*(-0.44748) +
                                1*1*(0.01598)
mean_colgpa_black_females_2

## [1] 2.32518

#Mean for black females
mean(black_females$colgpa)

## [1] 2.325575

#actual mean - predicted mean
mean(black_females$colgpa) - mean_colgpa_black_females_2

## [1] 0.0003952212

c(mean_colgpa_black_females, mean_colgpa_black_females_2) #higher in 2 2 i.e. m3

## [1] 2.31793 2.32518

So the model with interactions (m3 i.e. colgpa ~ female + black + female * black) have higher prediction for mean colgpa of black females, than the model without interaction (colgpa ~ female + black).

Please note that the interaction effect is not stat. significant! We should thus not put big beliefs in m3 model.

The actual mean is slighly higher. The model underpredicts again. Again, we can understand this from looking at the residualplot, where we see there it is some residuals that have values less than -2 whereas no residual is greater than 2. (See also the residual quantiles in summary(m3) that min(m3) = -2.6 and max(resid) = 1.8). The residuals are actually the same

plot(resid(m3)); abline(0,0); abline(-2,0)

2.3)

According to l8s13 and s14 we could test this two ways. We chose a Chow test.

We want to see if the relationship $colgpa = sat$ are different for men and women.

Eriks SSR is the SS Residuals = Residuals SS = RSS in our terms (which usually is denoted SSE as in SS Errors).

Theory of Chow tests

We will now perform a Chow test. Suppose that we model our data as $y_t=a+bx_{1t} + cx_{2t} + \varepsilon\,$ If we split our data into two groups, then we have $y_t=a_1+b_1x_{1t} + c_1x_{2t} + \varepsilon. \,$ and $y_t=a_2+b_2x_{1t} + c_2x_{2t} + \varepsilon. \,$ The null hypothesis of the Chow test is $H_0: \; a_1 = a_2, b_1=b_2, c_1=c_2,$ and there is the assumption that the model errors $\varepsilon$ are independent and identically distributed from a normal distribution with unknown variance.

Let $S_C$ be the sum of squared residuals from the combined data, $S_1$ be the sum of squared residuals from the first group, and $S_2$ be the sum of squared residuals from the second group. $N_1$ and $N_2$ are the number of observations in each group and k is the total number of parameters (in this case 3 since we have a b and c). Then the Chow test statistic is \[F = \frac{(S_C -(S_1+S_2))/(k)}{(S_1+S_2)/(N_1+N_2-2k)}.\] The test statistic follows the F distribution with $k$ and $N_1$ + $N_2-2k$ degrees of freedom. (These facts can for example be found on https://en.wikipedia.org/wiki/Chow_test)

(We use different notation than lecture 8 slide 14 but our k is his k+1 so he’s got another definition of k.)

Performing the test

k <- 2 #we have two paremeters, beta0 and beta1, in colgpa ~ sat.
n_f <- sum(gpa2$female) # number of females
n_m <- nrow(gpa2) - sum(gpa2$female) # number of males
m3 <- rlm(colgpa ~ sat, gpa2)
m3.male <- rlm(colgpa ~ sat, subset(gpa2, gpa2$female==0))
m3.female <- rlm(colgpa ~ sat, subset(gpa2, gpa2$female==1))
m3.res <- m3$residuals
m3.rss <- sum(m3.res^2)
m3.male.res <- m3.male$residuals
m3.male.rss <- sum(m3.male.res^2)
m3.female.res <- m3.female$residuals
m3.female.rss <- sum(m3.female.res^2)
numerator <- (m3.rss - (m3.male.rss + m3.female.rss) ) / k
denominator <- (m3.male.rss + m3.female.rss ) / (n_m + n_f - 2*k)
F_obs <- numerator / denominator
F_crit <- qf(1-0.01, k, n_f + n_m - 2*k) #alpha = 1 %
F_obs > F_crit  #reject H0 if true

## [1] TRUE

c(F_obs, F_crit) #Fobs is much larger than Fcrit

## [1] 76.703294  4.610305

Since we reject the null, we do believe that there is strong evicende that the relationship betweed sat and colgpa are different for males and females.

3) Functional Form

We want to investigate the relationship between the variables $colgpa$ and $sat$. Above we have seen the following facts from each question:

1 tothrs in colgpa ~ tothrs is significant (but have some bp weakness that errors are heterosked)
2.2 females have higher colgpa (loosely speaking) due to their coef in the model mfb.
2.2 blacks have lower colgpa (loosely speaking) due to their coef in the model mfb.
2.2 the interaction female*black is not significant in m3.
2.3 the relationship between sat and colgpa are different for males and females as seen by the chow test.
(we could run 2.3 chow test but for black==1 and black==0).

Models we can build

Regular transformations:

ffs   <- lm(colgpa ~ sat, gpa2)
ffss  <- lm(colgpa ~ sat^2, gpa2)
ffsl  <- lm(colgpa ~ log(sat), gpa2) #be wary of log close to zero, gpa2)
#colgpa ~ sqrt(sat) #oftentimes a bit similar to log

We will use square and log transformations.

Placing dummies on sat can be done by:

ff4s  <- lm(colgpa ~ sat                      + female, gpa2)
ff5s  <- lm(colgpa ~ sat + black, gpa2)
ff6s  <- lm(colgpa ~ sat                      + female + sat*female, gpa2)
ff7s  <- lm(colgpa ~ sat + black  + sat*black, gpa2)
ff8s  <- lm(colgpa ~ sat + black  + sat*black + female + sat*female, gpa2)

Note that we do not use for example colgpa ~ sat + black + sat*black + female + sat*female + sat*black*female beacuse it is we don’t want black*female since we estiablished above that it is not significant.

Adding tothrs

colgpa ~ sat + tothrs can be used.

ff4st  <- lm(colgpa ~ sat + tothrs                      + female, gpa2)
ff5st  <- lm(colgpa ~ sat + tothrs + black, gpa2)
ff6st  <- lm(colgpa ~ sat + tothrs                      + female + sat*female, gpa2)
ff7st  <- lm(colgpa ~ sat + tothrs + black  + sat*black, gpa2)
ff8st  <- lm(colgpa ~ sat + tothrs + black  + sat*black + female + sat*female, gpa2)

Taking logs and squares

Besides ffs ffss ffsl we now also have the models: and ff4s ff5s ff6s ff7s ff8s and ff4st ff5st ff6st ff7st ff8st.

Now we take the models above and apply log or square.

ff4ss  <-    lm(colgpa ~ sat^2                      + female, gpa2)
ff5ss  <-    lm(colgpa ~ sat^2 + black, gpa2)
ff6ss  <-    lm(colgpa ~ sat^2                      + female + sat*female, gpa2)
ff7ss  <-    lm(colgpa ~ sat^2 + black  + sat*black, gpa2)
ff8ss  <-    lm(colgpa ~ sat^2 + black  + sat*black + female + sat*female, gpa2)
ff4stl <-    lm(colgpa ~ log(sat) + tothrs                      + female, gpa2)
ff5stl <-    lm(colgpa ~ log(sat) + tothrs + black, gpa2)
ff6stl <-    lm(colgpa ~ log(sat) + tothrs                      + female + sat*female, gpa2)
ff7stl <-    lm(colgpa ~ log(sat) + tothrs + black  + sat*black, gpa2)
ff8stl <-    lm(colgpa ~ log(sat) + tothrs + black  + sat*black + female + sat*female, gpa2)

Thus we end up with 10 more models.

Note that for interpreting for example sat*female we need to have both sat and female to be stat. significant - otherwise we are not allowed to interpret sat*female.

The models

We have built: ffs ffss ffsl ff4s ff5s ff6s ff7s ff8s ff4st ff5st ff6st ff7st ff8st ff4ss ff5ss ff6ss ff7ss ff8ss ff4stl ff5stl ff6stl ff7stl ff8stl

comparing the models

rsquares <- c(
summary(ffs)$adj.r.sq ,
summary(ffss)$adj.r.sq ,
summary(ffsl)$adj.r.sq ,
summary(ff4s)$adj.r.sq ,
summary(ff5s)$adj.r.sq ,
summary(ff6s)$adj.r.sq ,
summary(ff7s)$adj.r.sq ,
summary(ff8s)$adj.r.sq ,  
summary(ff4st)$adj.r.sq , #9
summary(ff5st)$adj.r.sq , 
summary(ff6st)$adj.r.sq , #11
summary(ff7st)$adj.r.sq ,
summary(ff8st)$adj.r.sq , #13
summary(ff4ss)$adj.r.sq , 
summary(ff5ss)$adj.r.sq ,
summary(ff6ss)$adj.r.sq ,  
summary(ff7ss)$adj.r.sq ,
summary(ff8ss)$adj.r.sq ,  
summary(ff4stl)$adj.r.sq ,
summary(ff5stl)$adj.r.sq ,
summary(ff6stl)$adj.r.sq , #21
summary(ff7stl)$adj.r.sq ,
summary(ff8stl)$adj.r.sq) #23

rsquares

##  [1] 0.1668443 0.1668443 0.1588596 0.1962999 0.1698046 0.1962051 0.1699332
##  [8] 0.1994494 0.2123381 0.1879000 0.2122810 0.1879559 0.2154367 0.1962999
## [15] 0.1698046 0.1962051 0.1699332 0.1994494 0.2028373 0.1796000 0.2183375
## [22] 0.1942733 0.2233195

rsquares[c(9, 11, 13, 21, 23)]

## [1] 0.2123381 0.2122810 0.2154367 0.2183375 0.2233195

These models are ff4st ff6st ff8st ff6stl ff8stl. We prefer simple models over comples ones, so we delete ff8st and ff8stl because they use mode variables. We also prefer it if the residuals quantiles are “symmetric”. The three remaining models are compared below

library(stargazer)

## 
## Please cite as: 
## 
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer

stargazer(ff4st, ff6st, ff6stl, type = "html")


	Dependent variable:

	colgpa
	(1)	(2)	(3)

log(sat)			-3.519^***
			(0.612)

sat	0.002^***	0.002^***	0.006^***
	(0.0001)	(0.0001)	(0.001)

female:sat			-0.00004
			(0.0001)

tothrs	0.002^***	0.002^***	0.002^***
	(0.0003)	(0.0003)	(0.0003)

female	0.223^***	0.339^**	0.272^*
	(0.019)	(0.140)	(0.140)

sat:female		-0.0001
		(0.0001)

Constant	0.311^***	0.264^***	21.092^***
	(0.072)	(0.091)	(3.626)


Observations	4,137	4,137	4,137
R²	0.213	0.213	0.219
Adjusted R²	0.212	0.212	0.218
Residual Std. Error	0.585 (df = 4133)	0.585 (df = 4132)	0.582 (df = 4131)
F Statistic	372.661^*** (df = 3; 4133)	279.651^*** (df = 4; 4132)	232.057^*** (df = 5; 4131)

Note:	p<0.1; p<0.05; p<0.01

A quick look at their residual quantiles using summary() show no real difference there.

The models with the highest F-statistic, the lowest number of variables used, is model ff4st so we pick this one.

Conclusion

We chose model ff4st and it is summarized below

summary(ff4st)

## 
## Call:
## lm(formula = colgpa ~ sat + tothrs + female, data = gpa2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.88317 -0.38386  0.01403  0.40784  1.79410 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.105e-01  7.150e-02   4.343 1.44e-05 ***
## sat         2.054e-03  6.599e-05  31.129  < 2e-16 ***
## tothrs      2.377e-03  2.575e-04   9.229  < 2e-16 ***
## female      2.228e-01  1.851e-02  12.039  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5845 on 4133 degrees of freedom
## Multiple R-squared:  0.2129, Adjusted R-squared:  0.2123 
## F-statistic: 372.7 on 3 and 4133 DF,  p-value: < 2.2e-16

4 Errors in Variables

For simplicity of notation we let $\varepsilon = e$. qq eller ska jag sätta minus för att vara konsistent med vad frågan säger och vad jag får ut?

We know that $Var(ability) = 0.8$, $Var(e) = 0.2$ and $CoV(ability, e) = 0$.

4.1)

Theory

Our model 1 is: \[colgpa = \beta_0 + \beta_1 ability + u \quad (1) \\ CoV(ability, u) = 0\] where $u$ is the random error term.

This is our first model. But then we make some changes to id. Start by introducing $ability = sat - e \quad (\gamma)$ so the true ability is affected by some random noise $e$. So our model could be written $colgpa = \beta_0 + \beta_1 (sat - e) + u \Leftrightarrow colgpa = \beta_0 + \beta_1 sat + v$ where $v = u - \beta_1 e$.

Let’s further make the trivial assumption that $E[e]=0$. We then make the crucial assumption - the classical errors in variables assumption (CEV) - that $CoV(ability, e)=0.$ So to conclude, we have our model 2: \[colgpa = \beta_0 + \beta_1 sat + v \quad (2) \\ CoV(ability, e)=0.\]

From theory we know that this leads to two things.

(Std.Err up). Firstly, The standard errors go up when comparing (1) to (2). Since this is true in the PV case it should be true in the CEV case also. ²
(Biased coef) Secondly, the estimated coef of the sat score is biased, so that \[\operatorname{plim} \hat \beta_1 = \beta_1 + \frac{CoV(sat, v)}{Var(sat)} = \dots = \beta_1 \left( \frac{Var(ability)}{Var(ability)+Var(e)} \right).\] The scalar in front of $\beta_1$ is less than 1 so our new estimate is higher. Define $r = \text{reliability ratio} = \frac{Var(ability)}{Var(ability)+Var(e)}$. Then³ we get $model2coef \cdot r = model1coef$ or more formally \[\beta_{CEV}^{sat} \cdot r = \beta_{OLS}^{sat}.\]

As the last bit in this theory-part, let’s prove that $CoV(sat, e) = Var(e)$ which is important because $Var(e) > 0$ i.e. non-zero. We have that $CoV(sat, e) = E[(sat) e] - E[sat]E[e] = \text{ / E[e]=0 / } = E[(sat) e] = \\ = E[(ability + e)e] = E[e(ability) + e^2] = E[e(ability)] +E[e^2]= \\ = \text{/ we now assume CoV[e(ability)] = 0 and since E[e]=0 this implies E[e(ability)] = 0 / } \\ = E[e^2] = \text{/ subtract zero squared /} = \\ = E[e^2] - (E[e])^2 = Var(e).$

Output

We run eivreg colgpa sat, r(sat .9) in STATA and compare it to reg colgpa sat. Output is below.


. eivreg colgpa sat, r(sat .8)

                   assumed                      Errors-in-variables regression
    variable     reliability
----------------------------                           Number of obs =    4137
         sat       0.8000                              F(  1,  4135) =  873.03
           *       1.0000                              Prob > F      =  0.0000
                                                       R-squared     =  0.2088
                                                       Root MSE      =  .58592

------------------------------------------------------------------------------
      colgpa |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sat |   .0024138   .0000817    29.55   0.000     .0022537     .002574
       _cons |   .1656496   .0846635     1.96   0.050    -.0003365    .3316356
------------------------------------------------------------------------------

. reg colgpa sat

      Source |       SS       df       MS              Number of obs =    4137
-------------+------------------------------           F(  1,  4135) =  829.26
       Model |  299.712725     1  299.712725           Prob > F      =  0.0000
    Residual |  1494.48295  4135   .36142272           R-squared     =  0.1670
-------------+------------------------------           Adj R-squared =  0.1668
       Total |  1794.19567  4136  .433799728           Root MSE      =  .60118

------------------------------------------------------------------------------
      colgpa |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sat |   .0019311   .0000671    28.80   0.000     .0017996    .0020625
       _cons |   .6630568   .0697213     9.51   0.000     .5263656     .799748

The results are (for the eivreg)

$coef = .0024138 = \beta_{CEV}^{sat}$
$std.err = .0000817$

Compared to (for the usal reg)

$coef = .0019311 = \beta_{OLS}^{sat}$
$std.derr = .0000671$

Theory vs output

Let us compare the theory to the output we have, with regards to the two claimed results.

Standard errors

Standard error for the sat coef went from .0000671 in reg to .0000817 in eivreg just as theory predicted the Std.Error was higher.

Biased coef

In our case the reliability ratio is $r=0.8$. So (the coef from model 2) * (.8) = (coef from model 1). Or stated differenlty: (the coef from eivreg) * (.8) = (coef from reg) and this is observed in the Stata output where we see that coef from eivreg = .0019311 and coef from reg = .0019311.

(As a final check, if we did not know the $r$ value we could still make a reality check: we know $\beta_{CEV}^{sat} = .0024138$ and $\beta_{OLS}^{sat} = .0019311$ and given that $r \leq 1$ we can conclude that if the coef is positive then we should observe $\beta_{OLS}^{sat} < \beta_{CEV}^{sat}$ and for us this is indeed the case since $.0019311$ < .0024138$.)

PS a note on Stata vs R

Do note that we compared the eivreg coef output from stata with the reg output from stata just to make it easy for the eyes - we could have chose to compare eivreg to R output as well. A quick look at anova(lm(colgpa ~ sat, gpa2)) show us that it has the same F-value, and from summary((lm(colgpa ~ sat, gpa2))) we see that they have the same coef and t-stat, only with other precision in their number of digits.

summary(lm(colgpa ~ sat, gpa2))$coef #coef in R regression

##                Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept) 0.663056784 6.972127e-02  9.510108  3.134626e-21
## sat         0.001931058 6.705796e-05 28.796842 2.322657e-166

# std error of sat in stata is 0.0000671
0.0000671 - summary(lm(colgpa ~ sat, gpa2))$coef[2,2]

## [1] 4.20383e-08

The difference is tiny since 6.705796e-05 ≈ 6.71e-05 = 0.0000671 so R uses more digits, that’s all.

4.2)

We have a measurement error in the dependent variable colgpa.

People want to impress, so they are more likely to round up their estimates. A true 3.75 might be reported as a 4.0 GPA and a true 4.3 as a 4.5 GPA. Let’s call this bragging.

Denote observed gpa as $gpa$. Denote true gpa as $gpa^*$. If people brag about their GPA, then we expect that $gpa^* \leq gpa$ for almost all students and that on average $gpa^* < gpa$. If we define $e_0 = gpa - gpa^*$ then $E[e_0] > 0$.

How would this affect regression estimates? For simplicity of notation, we keep the discussion to the regression colgpa ~ sat.

We know from lecture 10 slide 5 that if we assume $CoV(sat, e_0) = 0$ then OLS is consistent. But since $CoV(sat, e_0) = E[sat \cdot e_0] - E[sat]E[e_0]$ this assumptions is really two assumptions in one and we split them up in two parts

$E[e_0] = 0$ (the trivial assumption)
$E[sat \cdot e_0] = 0$ (the important assumption)

The first assumption is trivial since if this hold we would only re-interpret out estimates in a very fast way, as shown in lecture 2 slides 10 to 14 (especially slide 12).

The second assumption is the important one, since this fact is used in the derivation of the OLS estimators.

In our regression colgpa ~ sat we still assume that $E[sat \cdot e_0] = 0$ (so the important assumption hold) but we allow for the trivial assumption to be violated; in fact this vialotation is what the question asks us for since bragging is going on. The fact that the important assumption hold makes the OLS estimates consistent (just as lecture 10 slide 5 say they are), but the violation of the trivial assumptions makes us need to re-intrepret the intercept-coef $\beta_0$ (just as lecture 2 slide 12 tells us). Regression of the usual form is $gpa^* = \beta_0 + \beta_1 sat + u$ where $E[u]=0$. But using our updated model with bragging we get $gpa = \beta_0' + \beta_1' sat + u + e_0.$ By just re-arranging the latter and taking expectation we get $\beta_0' = E[gpa] - \beta_1 E[sat] - E[u] - E[e_0]$ where $E[u]=0$ from before and since $E[e_0] > 0$ due to bragging we conclude that $\beta_0 > \beta_0'$. So if we believe bragging is present then the intercept of the regression colgpa ~ sat gets smaller. But we rarely interpret the intercept coef.

5 Linear Probability Model

What could be the determinants of a student becoming an athlete in college?

A logistic regression is perfect for this setting, but we will use the probabilty model since that is what is asked for.

Let x $= x_1, ..., x_k$. Athlete is a dummy variable, that only takes on 0 or 1. Since E[athlete | x] = P[athlete = 1 | x] we can use the linear probabilty model.

What should we put in x? Do note that most students that are athletes in college probably are athletes before they go to college. This is important background information for our analysis. Furthermore, many colleges have lower levels of required high school gpa (as reflected by the variables hsperc and hsrank) since begin an athelte can put you in a different bucket where you compete with fewer number of students and hence a lower grade is required.

What is a criteron for a “good” model? There are a lot of criterons, for example

It makes economical sense, i.e. it is intuitive.
High $R^2_{adj}$.
Parsimony, so fewer variables are better than more.
Data fit our assumptions.

How do we do this in practice? We might split our data into 30 % observations into a “training-set”, then setup a stepwise selection process, and then validate the chosen model against the other 70 % of the data “testing-set”.

We will, however, first reason us through what varaibles our model should contain - based on “economist-intuition” and a few checkups on the data - and then create a model and evaluate it.

5.1)

Formulate a linear probability model to answer this question using the variables available in the dataset, as well as suitable transformations of those variables.

The possible variables are

sat       combined SAT score
tothrs    total hours through fall semest
colgpa    GPA after fall semester
athlete   =1 if athlete
verbmath  verbal/math SAT score
hsize     size grad. class, 100s
hsrank    rank in grad. class
hsperc    high school percentile, from top
female    =1 if female
white     =1 if white
black     =1 if black
hsizesq   hsize^2

Let’s see how hsize, hsrank, hsperc look.

range(gpa2$hsize)

## [1] 0.03 9.40

range(gpa2$hsrank)

## [1]   1 634

range(gpa2$hsperc)

## [1]  0.1666667 92.0000000

gpa2[1:9, c(6,7,8)]

##   hsize hsrank   hsperc
## 1  0.10      4 40.00000
## 2  9.40    191 20.31915
## 3  1.19     42 35.29412
## 4  5.71    252 44.13310
## 5  2.14     86 40.18692
## 6  2.68     41 15.29851
## 7  3.11    161 51.76849
## 8  2.68    101 37.68657
## 9  3.67    161 43.86921

Now we understand there variables better, which is important in the model selection. It seems like they measure class size (hsize) and students’ performance relative to the class (hsrank) or performance relative to the school (hsperc).

Let us get to the model by thinking about (and guessing) which variables should be important.

The number of athletes in our dataset is important because if this number is too small, say 9 athletes, then it is very hard for us to make any generalizations or predictions.

sum(gpa2$athlete==1)

## [1] 194

First round of guessing

Our guess would be to include

female. We think fewer females are athletes.
black & white. Different race might be over/under-represented.
tothrs. this variable is the hours of class you have and we suspect athletes may have a different schedule that allows them to not have the same number of classes.

Let us investigate these guesses with some tables, which are facts from the sample we have (and we try to use this sample to say something about the general population).

Female

CrossTable(gpa2$athlete, gpa2$female)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4137 
## 
##  
##              | gpa2$female 
## gpa2$athlete |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      2128 |      1815 |      3943 | 
##              |     0.821 |     1.006 |           | 
##              |     0.540 |     0.460 |     0.953 | 
##              |     0.935 |     0.976 |           | 
##              |     0.514 |     0.439 |           | 
## -------------|-----------|-----------|-----------|
##            1 |       149 |        45 |       194 | 
##              |    16.696 |    20.439 |           | 
##              |     0.768 |     0.232 |     0.047 | 
##              |     0.065 |     0.024 |           | 
##              |     0.036 |     0.011 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      2277 |      1860 |      4137 | 
##              |     0.550 |     0.450 |           | 
## -------------|-----------|-----------|-----------|
## 
##

149/2277 #fraction of male that are athletes

## [1] 0.06543698

45/1860 #fraction of females that are athletes

## [1] 0.02419355

149/194 #of those who are athletes, approx 3/4 are male.

## [1] 0.7680412

Conclusion: An athlete is more likely to be male, since 3/4 of athletes are male.

Conclusion: Amongst males, being an athlete is more usual than among females.

So we include the variable female.

Black & white

#ftable(xtabs(~athlete + black, gpa2))
CrossTable(gpa2$athlete, gpa2$black)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4137 
## 
##  
##              | gpa2$black 
## gpa2$athlete |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      3758 |       185 |      3943 | 
##              |     0.297 |     5.069 |           | 
##              |     0.953 |     0.047 |     0.953 | 
##              |     0.962 |     0.808 |           | 
##              |     0.908 |     0.045 |           | 
## -------------|-----------|-----------|-----------|
##            1 |       150 |        44 |       194 | 
##              |     6.037 |   103.021 |           | 
##              |     0.773 |     0.227 |     0.047 | 
##              |     0.038 |     0.192 |           | 
##              |     0.036 |     0.011 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3908 |       229 |      4137 | 
##              |     0.945 |     0.055 |           | 
## -------------|-----------|-----------|-----------|
## 
##

44/229 #fraction of atheltes among black

## [1] 0.1921397

150/3908 #fraction of atheltes among nonblack

## [1] 0.0383828

Conclusion: Among blacks, being an athlete is more usual than among non-blacks.

#ftable(xtabs(~athlete + white, gpa2))
CrossTable(gpa2$athlete, gpa2$white)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4137 
## 
##  
##              | gpa2$white 
## gpa2$athlete |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       255 |      3688 |      3943 | 
##              |     5.064 |     0.407 |           | 
##              |     0.065 |     0.935 |     0.953 | 
##              |     0.828 |     0.963 |           | 
##              |     0.062 |     0.891 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        53 |       141 |       194 | 
##              |   102.928 |     8.279 |           | 
##              |     0.273 |     0.727 |     0.047 | 
##              |     0.172 |     0.037 |           | 
##              |     0.013 |     0.034 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       308 |      3829 |      4137 | 
##              |     0.074 |     0.926 |           | 
## -------------|-----------|-----------|-----------|
## 
##

141/3829 #fraction of atheltes among white

## [1] 0.03682424

53/308 #fraction of atheltes among nonwhite

## [1] 0.1720779

Conclusion: Among whites, being an athlete is less usual than among non-whites.

#but is black and white measuring the same thing?
#turns out - it almost does.
nrow(subset(gpa2, gpa2$black==0 & gpa2$white==0)) #neither black nor white

## [1] 79

# this group is very small in our dataset of 4000 students
nrow(subset(gpa2, gpa2$black==0 & gpa2$white==0)) / nrow(gpa2)

## [1] 0.01909596

Conclusion: Here we see that black and white are almost measuring the same thing. For practical purposes, we will therefore not include both black and white in our model, but instead stick to only black. (We could have chosen only white, of course.)

In summary, we include the variable black but not the variable white.

Tothrs

The variable tothrs is purely based on beliefs and not on tabulated data. If tothrs is important for becoming an athlete, the distribution of tothrs should not look the same for athletes and non-athletes.

ggplot(gpa2, aes(tothrs)) +
    geom_histogram(fill = "white", color = "black") +
    facet_wrap(~athlete, scales = "free_y", ncol=1) +
    theme_bw() +
    ggtitle('Similarities:
            Distribution for Non-athletes (left) vs Distribution for athletes (right)'   )

Indeed, our eyes cannot distinguish a difference in the distributions. The look pretty much the same.

For these two reasons, we do not include the variable tothrs.

Second round of guessing

Ee might want to use

colgpa. It should be a correlation between gpa and athlete because athletes have in general lower standards for getting admitted, and this probably leads to them tending to have worse study habits than their college peers - whom they are competing against regarding high colgpa. We know, however, from assignment 2 that athlete could not explain the variability in colgpa.
sat. For the same reason as colgpa althugh we believe this is weaker than colgpa since sat is performed on one day and then the extra time and study habits that we speculate that non-athletes have, might not be as important for sat as for colgpa.
hsperc. Since they have different ways of getting admitted, we think atheltes have lower hsperc.

Colgpa and sat

Should we include only colgpa or only sat or both? They might measure the same thing namely how good you are in school. Let’s plot.

ggplot(gpa2, aes(colgpa)) +
    geom_histogram(fill = "white", color = "black") +
    facet_wrap(~athlete, scales = "free_y", ncol=1) +
    theme_bw() +
    ggtitle("colgpa distribution")

Clearly, there is a difference. And by the same argument as before, this might be reason to include colgpa. We also see that it looks as if athletes get lower grades.

ggplot(gpa2, aes(sat)) +
    geom_histogram(fill = "white", color = "black") +
    facet_wrap(~athlete, scales = "free_y", ncol=1) +
    theme_bw() +
    ggtitle("sat distribution")

cor(gpa2$sat, gpa2$colgpa) # correlation

## [1] 0.4087123

The difference between athlete=0 and athelte=1 is not as clear for sat as for colgpa, and by looking at the number of counts on the y-axis we see that in the sat-distribution the count are around 20 for colgpa but only around 10 for sat (by eyeballing them) so we don’t believe there is enough data to claim that the distribution for sat looks different, whereas we do believe there is enough data to say that the distribution for colgpa looks different.

Furthermore, the correlation of $0.41$ indicates that including both might introduce some multicollineraty (although not a huge problem since cut-off points are usually set higher; around 0.70 is one rule of thumb according to Inger Persson at Uppsala Statistiska Instution). Because they are correlated and because the second plot did not show a lot of difference, we will exclude it from our model.

To summarize, we include colgpa but not sat.

hsperc

As we argued earlier, the application and selection process makes us believe hsperc is an improtant variable to include. Athletes compete on slightly different high school merits than non-athletes does.

The other variables

We see no reason to include more variables. We already have the ones we think are important. Furthermore, adding more variables would make the model more complex which is not in line with our goals of creating a simple model.

We did not use sat because of an argument regarding sat. And we do not use verbmath since we think sat is a better variable than verbmath, but sat was not good enough to include.

We did include hsperc. But hsize and hsrank are also related to high school which we argued earlier is important in finding atheltes among college students due to the selection process. So why not include hsize and hsrank? We don’t include hsize because we see no econmical reason. We don’t include hsrank since it almost measure the same thing as hsperc (and they have a large correlation of cor(gpa2$hsrank, gpa2$hsperc) = 0.61) so we don’t want to include both of them, and since we think hsperc have an easier interpretation we chose hsperc.

Our model

The variables we have chose are female, black, colgpa, hsperc. They measure gender, race, grades in college, grades in high school. We think they make sense to distinguish athletes from non-athletes.

am1 <- lm(athlete ~ female + black + colgpa + hsperc, gpa2)
summary(am1)

## 
## Call:
## lm(formula = athlete ~ female + black + colgpa + hsperc, data = gpa2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34850 -0.06539 -0.02596 -0.00155  1.03138 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0347003  0.0171984  -2.018   0.0437 *  
## female      -0.0309739  0.0064241  -4.821 1.48e-06 ***
## black        0.1699999  0.0140753  12.078  < 2e-16 ***
## colgpa       0.0121914  0.0054093   2.254   0.0243 *  
## hsperc       0.0027951  0.0002138  13.075  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.203 on 4132 degrees of freedom
## Multiple R-squared:  0.07904,    Adjusted R-squared:  0.07815 
## F-statistic: 88.66 on 4 and 4132 DF,  p-value: < 2.2e-16

comment on expected signs of the coefficients,

Interpretation of coef sign (positive or negative)

female < 0 if female => Pr(athlete=1) goes down
black > 0 if black => Pr(athlete=1) goes down
colgpa > 0 if up => Pr(athlete=1) up
hsperc > 0 if up => Pr(athlete=1) up

Ordering variables by their coef magnitude (largest first)

black 0.1699999
female 0.0309739 (negative)
colgpa 0.0121914
hsperc 0.0027951

This comparison show us which variables we are the most relevant to us (there is a difference between statistical significance and practical relevance - if the coef value is 0.0000001 the effect is negligible).

Ordering variables by their coef statistical significance (largest t-value first)

hsperc 13.08
black 12.08
female 4.82
colgpa 2.25

This comparison show us which variables we are the most sure of: we are really sure hsperc and black is indeed affecting athlete=1 whereas we are not as sure female does, and even less sure that colgpa does. However, they are all stat significant (the first three at alpha = 0.001 and the fourth at alpha = 0.05) so this comparison is not super relevant.

estimate the model

Our estimated model is

\[athlete = - 0.0347003 - 0.0309739 female + 0.1699999 black + 0.0121914 colgpa + 0.0027951 hsperc.\]

and interpret the results.

If someone is a female it is 0.0309739 less likely they are an athlete. If someone is black it is 0.1699999 more likely they are an athlete. For each jump in colgpa (remember it takes on 1:4) it is 0.0121914 more likely they are an athlete. For each higher value of hsperc it is 0.0027951 more likely you are an athlete.

Probability check

Remember that we are modelling a probability.

By definition a probility is in the interval [0,1]. Let us maximize the variables allowed value and see if the model will predict a probability larger than 1. Let us minimze the variables allowed value and see if the model will predict a probability smaller than 0.

am1.coef <- coefficients(am1)

maximize <-
    am1.coef[1] +    #intercept
    am1.coef[2]*0 +  #female
    am1.coef[3]*1 +  #black
    am1.coef[4]*max(gpa2$colgpa) +
    am1.coef[5]*max(gpa2$hsperc)

minimize <-
    am1.coef[1] +    #intercept
    am1.coef[2]*1 +  #female
    am1.coef[3]*0 +  #black
    am1.coef[4]*min(gpa2$colgpa) +
    am1.coef[5]*min(gpa2$hsperc)

c(minimize > 0, maximize < 1) # want to return true true

## (Intercept) (Intercept) 
##       FALSE        TRUE

c(minimize, maximize)

## (Intercept) (Intercept) 
## -0.06520837  0.44121610

Since $minimize = -0.065$ we have a small problem in the model that should be fixed, and it can be so using a transformation, for example using logistic regression so that we force $0<y<1$ or another transformation. We skip this.

5.2)

When we have heteresked, $\hat \beta_j$ is still unbiased but $Var(\hat \beta_j)$ is biased and this leads to problems in hypothesis testing. One solution is to use robust standard errors.

If the errors are not homosked we will use robust standard errors. (In this case WLS also be used, but robust is simpler.)

Is data homosked or heterosked?

model5.2 <- lm(colgpa ~ female + black + hsperc, gpa2)
bptest( model5.2 ) #test H0: data i homosked

## 
##  studentized Breusch-Pagan test
## 
## data:  model5.2
## BP = 12.125, df = 3, p-value = 0.006967

We reject H0 and thus believe data is heterosked.

Since we used lm() vs rlm() in a previous question we here use the Stata commands reg athlete colgpa female black hsperc, robust and reg athlete colgpa female black hsperc to get the output below.


. reg athlete colgpa female black hsperc, robust

Linear regression                          Number of obs =    4137
                                           F(  4,  4132) =   31.00
                                           Prob > F      =  0.0000
                                           R-squared     =  0.0790
                                           Root MSE      =  .20301

-----------------------------------------------------------------------------
             |               Robust
     athlete |      Coef.   Std. Err.      t    P>|t|    [95% Conf. Interval]
-------------+---------------------------------------------------------------
      colgpa |   .0121914   .0049482     2.46   0.014    .0024902    .0218926
      female |  -.0309739   .0058932    -5.26   0.000   -.0425278     -.01942
       black |   .1699999   .0248847     6.83   0.000    .1212125    .2187872
      hsperc |   .0027951   .0003168     8.82   0.000     .002174    .0034162
       _cons |  -.0347003   .0171553    -2.02   0.043   -.0683339   -.0010667
-----------------------------------------------------------------------------


. reg athlete colgpa female black hsperc

      Source |       SS       df       MS            Number of obs =    4137
-------------+------------------------------         F(  4,  4132) =   88.66
       Model |  14.6154553     4  3.65386382         Prob > F      =  0.0000
    Residual |  170.287131  4132  .041211794         R-squared     =  0.0790
-------------+------------------------------         Adj R-squared =  0.0782
       Total |  184.902586  4136  .044705654         Root MSE      =  .20301

----------------------------------------------------------------------------
     athlete |      Coef.   Std. Err.      t    P>|t|    [95% Conf. Interval]
-------------+--------------------------------------------------------------
      colgpa |   .0121914   .0054093     2.25   0.024     .0015863   .0227966
      female |  -.0309739   .0064241    -4.82   0.000    -.0435687  -.0183791
       black |   .1699999   .0140753    12.08   0.000     .1424047    .197595
      hsperc |   .0027951   .0002138    13.07   0.000      .002376   .0032142
       _cons |  -.0347003   .0171984    -2.02   0.044    -.0684184  -.0009822
-----------------------------------------------------------------------------

Notably F statistic is 31 (robust) vs 88. So we are not as sure that our model is a good fit of the data when using robust. We also note a big difference in the t-statistic for black and hsperc in robust vs non-robust.

By using robust we are more critical to our data, and therefore we usually get that “Robust standard errors are typically somewhat larger than non-robust standard errors, implying smaller t-statistics. But robust standard errors can be smaller. Usually not a big diference”. Compare the two outputs to see how they differ; a quick comparison is done below.


  Robust         
 Std. Err.       Std. Err.
----------       ---------
 .0049482    <    .0054093 
 .0058932    <    .0064241 
 .0248847    >    .0140753 
 .0003168    >    .0002138 
 .0171553    <    .0171984

So it is not clear that in general applying robust should always change the output in one way or another, and for our data, the result is summarized above.

Even though the difference is not huge, standardized should be used because they are better in theory.

The variable tothrs measure “total hours through fall semest”.↩
If we have the quicker notation $x_1 = x_1^* + e$ the PV case can be described as assuming $Cov(x_1, e)=0$ (PV) rather than assuming $Cov(x_1^*, e)=0$ (CEV) as we do in this assignment. Lindqvist explicitly derives the result of higher standard errors only in the PV case, so let us reason through why this must also hold in the CEV case. The formulas tells us that in a SLR $Var( \hat \beta_1 ) = Var(v) / SSTx$ and $Var(v) = Var(u - \beta_1 e) = Var(u) + \beta_1^2 Var(e) - 2 \beta_1 CoV(u,e)$, and since on lecture 10 slide 5 the assumption that $e$ and $u$ are independent was made so we will make the same assumption, which leads us to $Var( \hat \beta_1 ) = Var(u) + \beta_1^2 Var(e) + 0 > Var(u)$. (Do note that we make the assumption of $Cov(u,e) = 0$ although we could arrive at the same conclusion with a slightly weaker assumption that $\beta_1^2 Var(e) - 2 \beta_1 CoV(u,e) > 0$.)↩
This follows if we look at the equation above containing $\operatorname{plim}$ and reminding ourselves of the fact that model1coef is unbiased.↩

Econ 651 Assignment 3

Jacob Lindberg and Alexander Matsson