Research question

Does engaging in sport activities affect the students’ GPA? Download the data set gpa2.dta from the course web page to answer this question. The data set contains information about the grade point average (GPA) of students enrolled in a fairly large research university that also supports men’s and women’s athlet- ics at the Division I level. In addition to GPA, the data set contains information about Scholastic Assessment Test (SAT) which is the national standardized test taken by all four-year college-bound students in the United States.

Note on Stata and R

We use R, knitr and Markdown.

This document can be read from any device as an HTML-version on http://rpubs.com/lindberg/110268

Note that the document, thanks to knitr magic, will be written in the following way. First we talk a bit, like in this paragraph. Then to find answer a question, for example “How do you add 1 and 2?” we write some code and under the code we print the output. So here comes the code and the output

1+2
## [1] 3

Note that the [1] indicates that this is the first line output. If there are more lines, a [2] would be added to row 2 of the ouput

Stata to R

Here we

  1. Import to stata.
  2. Get the Variable descriptions.
  3. csv export.
  4. Import to R.

We load the data into Stata using

use "/Users/jacoblindberg/Dropbox/pathtofile/gpa2.dta"
describe

Running describe in Stata give us an output where we learn the definitions of the variables.

sat       combined SAT score
tothrs    total hours through fall semest
colgpa    GPA after fall semester
athlete   =1 if athlete
verbmath  verbal/math SAT score
hsize     size grad. class, 100s
hsrank    rank in grad. class
hsperc    high school percentile, from top
female    =1 if female
white     =1 if white
black     =1 if black
hsizesq   hsize^2

We export from Stat to csv

outsheet sat tothrs colgpa athlete verbmath hsize hsrank hsperc female white black hsizesq using ddata2.csv , comma nolabel 

Now we have everything we need from Stata, and can say goodbye to it.

Then we import it to R

#note that i renamed the dataset from gpa.csv to ddata.csv for no reason
gpa2 <- read.csv("~/Dropbox/aktuellaKurser/651econ/datorlabbar/econA2/ddata2.csv")

Library

We use the following R library:

library(ggplot2) 
library(stargazer) 
## 
## Please cite as: 
## 
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## 
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## 
## The following objects are masked from 'package:base':
## 
##     as.array, trimws
library(ascii)

A note on one sided vs two sided hypothesis test

We use two-tailed p-values for these reasons:

  1. Two-tailed is more conservative.
  2. Two-tailed is easier to interpret.
  3. When comparing three or more groups, the idea of tails is a bit weird. A two-tailed p-value is more consistent with these tests.
  4. If \(H_1: \mu > 0\) but you observe \(\bar x = -100\) and let’s say this is 5 st.dev. away from the mean. What do you do? You should fail to reject the null. But this is not a good situation to be in. Your preexisting thoughts were super wrong and now, if you want to be an honest statistician, you cannot reject the null in this case (as you would have been able to do with a two-sided test).

Question 1

Are there missing values in the sample? In general, why should an econometrician care about missing values?

The size of our sample is \(n=4136\).

#The first row contains the variables so our sample size is
n <- nrow(gpa2) -1

The number of missing values are

sum(is.na(gpa2))
## [1] 0

Missing values can occur because of nonresponse in a questionaire or because we made an input error in the data handling. It can be handling using various types of imputation or deletion of the observations that do have missing values. If the missing values are not handled properly, then the econometrician may end up making an inaccurate inference about the data.

Question 2

First plot the distributions of the variables sat (this variable indicates the students’ combined SAT score) and tothrs (this variable indicates the total hours of classes taken through fall semester), then comment on graphs.

We create the two histograms.

# we define a new function
number_ticks <- function(n) {function(limits) pretty(limits, n)}
# graphing sat
ggplot(gpa2, aes(sat)) +
    geom_bar(fill = "white", color = "black") +
    ggtitle("Histogram of SAT score")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

# graphing tothrs
ggplot(gpa2, aes(tothrs)) +
    geom_bar(fill = "white", color = "black", binwidth = 2) +
    ggtitle("Histogram of Total hours") +
    scale_x_continuous(breaks = number_ticks(20))

# a note on binwidth. we set binwidth = 1 since tothrs is probably reported per whole hours (and since the default option gave us a strange output) but later on we changed it to binwidth = 2 to make the picture look better

# a small detail on number_ticks() function. the graph scale_x_continuous() decides how many ticks are on the x axis. in order to put in a smart value in "breaks = " into the function scale_x_continuous() we created the number_ticks() function

The sat histogram looks like a normal distrubtion, with mean approx 1000.

For the tothrs histogram, it looks as if we have four categories of people, % since there are three “local maxima” so to speak – and they occur at the tothrs values 15, 45, 80 and then it flattens out around 115. and the four groups of people would be the ones whose values of tothrs fall into these intervals

although the cutoff at 95 is a bit questionable. Of course, another interpretation of this graph is possible.

Question 3

Estimate the following model by OLS 1 \(colgpa = \beta_0 + \beta_1athlete + u \quad (1)\) and report the results both in equation form and in a regression table output (which includes also standard errors, sample size of the sample and R-square). How do you interpret the coefficient \(\hat \beta_1\) ? Is it likely to be an unbiased estimate of \(\beta_1?\) Why or why not?

# Estimating the following model by OLS: colgpa = Bo + B1(athlete) + u
model1 <- lm(formula = colgpa ~ athlete, data = gpa2)

#to have a more robust outputs i.e. reporting more information from the analysis e.g. standard error, p-value, R-square e.t.c.
#summary(model1)
#stargazer(model1, type="text") #ejbra
stargazer(model1, type="html")
Dependent variable:
colgpa
athlete -0.285***
(0.048)
Constant 2.666***
(0.010)
Observations 4,137
R2 0.008
Adjusted R2 0.008
Residual Std. Error 0.656 (df = 4135)
F Statistic 34.790*** (df = 1; 4135)
Note: p<0.1; p<0.05; p<0.01

In the output we see that \(df=4135\) and since \(df=n-k-1\) where k is the number of predictor variables so \(df=n-1-1\) and \(n=4135+2=4137\).

Question 4

Now, let’s set up a multiple regression analysis. Add to model 1 the regressor sat: \(colgpa = \beta_0 + \beta_1 athlete + \beta_2 sat + u \quad (2)\) and report the estimated coefficients.

# multiple regression analysis: colgpa = Bo + B1(athlete) + B2(sat) + u
model2 <- lm(colgpa~athlete + sat, data=gpa2)
summary(model2)
## 
## Call:
## lm(formula = colgpa ~ athlete + sat, data = gpa2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.84611 -0.38276  0.03056  0.42472  1.76647 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.801e-01  7.134e-02   9.533   <2e-16 ***
## athlete     -5.061e-02  4.499e-02  -1.125    0.261    
## sat          1.917e-03  6.823e-05  28.092   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6012 on 4134 degrees of freedom
## Multiple R-squared:  0.1673, Adjusted R-squared:  0.1669 
## F-statistic: 415.3 on 2 and 4134 DF,  p-value: < 2.2e-16

Generate a table in which to store the output of regressions (1) and (2).

We can compare model2 and model4 two ways: by looking back and forth between the two outputs. Or we can look at the following table.

#mtable(model1, model2)
stargazer(model1, model2, type = "html")
Dependent variable:
colgpa
(1) (2)
athlete -0.285*** -0.051
(0.048) (0.045)
sat 0.002***
(0.0001)
Constant 2.666*** 0.680***
(0.010) (0.071)
Observations 4,137 4,137
R2 0.008 0.167
Adjusted R2 0.008 0.167
Residual Std. Error 0.656 (df = 4135) 0.601 (df = 4134)
F Statistic 34.790*** (df = 1; 4135) 415.289*** (df = 2; 4134)
Note: p<0.1; p<0.05; p<0.01

In light of the result from (2), how would you interpret your results from (1)? (Hint: refer to formula 3.23 from Wooldridge). In other words, how good or bad is being an athlete for students’ GPA?

We see that the most important points are:


As a side note, we can see that ahtlete is correlated with sat which is may be of interest to consider when comparing model1 and model2.

cor(gpa2$athlete, gpa2$sat)
## [1] -0.1850938
#-0.18 is a correlation that affects the results

Let us use 3.23 in Wooldridge. If the following hold

then we know that

\[\tilde \beta_1 = \vec \beta_1 + \tilde \delta_1 \cdot \vec \beta_2 \quad (3.23)\]

We used rigorous notation in Assignment 1. Hence we allow ourself to loosen up the notation for an easier read. The notation here is that we (name-the-equation) in parenthesis. The theorem states

IF
(equation_model1) colgpa  = gamma0 + gamma1*athlete
(equation_model2) colgpa  = beta0 + beta1*athlete + beta2*sat
(third) sat = delta0 + delta1*athlete            
THEN
(3.23) gamma1 = beta1 + delta1*beta2

(Do note the name of these coefficients since we will reference to them in the text from now on – just remember that gamma is model1 and beta is model2.)

We create model (c) by the regression “sat ~ athlete”. The second coefficient is found using lm(athlete ~ sat, gpa2)$coef[2] and below we store it as delta1. Likewise for the other coefficients.

delta1 <- lm(sat ~ athlete, gpa2)$coef[2]
gamma1 <- lm(colgpa ~ athlete, gpa2)$coef[2]
beta1 <- lm(colgpa ~ athlete + sat, gpa2)$coef[2]
beta2 <- lm(colgpa ~ athlete + sat, gpa2)$coef[3]

Below are the coefficients, and a test is 3.23 holds

delta1
##   athlete 
## -122.0331
gamma1
##    athlete 
## -0.2845336
beta1
##     athlete 
## -0.05061457
beta2
##         sat 
## 0.001916848
#and now we test if 3.23 hold.
# below should be zero
beta1 + delta1*beta2    -    gamma1
##       athlete 
## -5.551115e-17
#and it is (-5.55e-17 is approx zero but rounding errors make it slightly different).

Now that we have found an important variable sat we can conclude that \(.167\) of the variation in colgpa is due to athlete + sat and that even though the estimated coeffficient for athlete is negative it is not statistically significant so we cannot state that begin an athlete or not affect your college gpa in any direction.

(We could also include an anova-table to see that model2 indeed explain more than model1 but since we have already reached a conclusion and anova-tables are done so much in the next questin we skip that here. Also, just for fun, we ran another model colgpa ~ sat and found out that summary(lm(colgpa ~ sat, gpa2))$adj.r.sq returns \(0.1668443\) i.e. very close to the model colgpa ~ athlete + sat so athlete is not really contributing any explanation. In light of this extra research we would drop athlete from the regression.)

Question 5

“Does the standard error for the coefficient of athlete increase or decrease when sat is included in the regression? Why?”

The Std. Error can be found in the output from summary(model1). For convinience have we also included them below.

#Std.Error for the coef of athlete in model model1 is
se.coef.athlete.model1 <- sqrt(diag(vcov(model1)))[2]
#likewise for model2
se.coef.athlete.model2 <- sqrt(diag(vcov(model2)))[2]
# the two estimates and the difference
se.coef.athlete.model1 
##    athlete 
## 0.04823989
se.coef.athlete.model2
##    athlete 
## 0.04498753
se.coef.athlete.model1 - se.coef.athlete.model2
##     athlete 
## 0.003252366

So the standard error of the athlete coefficient is lower for model2. We look at the formula \[se(\beta_j) = \frac{\hat \sigma_u}{SST_j(1−R_j^2)}\] where \[SST_j = \sum_{i=1}^n (x_{ij} - \bar x_j)^2\] is the total sample variation in \(x_j\) and \(R_j^2\) is the \(R^2\) from regressing \(x_j\) on all the other independent variables.

Our model1 and model2 have the same \(SST_1\) since \[ SST_1^{model1} = \sum_{i=1}^n (x_{i1} - \bar x_1)^2 = SST_1^{model2}. \]

Let us restate model1 and model2.

(equation_model1) colgpa  = gamma0 + gamma1*athlete
(equation_model2) colgpa  = beta0 + beta1*athlete + beta2*sat

In model2 we add sat. By looking at the definition of \(R_j^2\) (it is the \(R^2\) from regressing \(x_j\) on all the other independent variables) we see that \(R_1^2\) must be lower for model2 than for model1, since we in model2 added the extra variable sat.

Now we can simply see from the equation for \(se(\beta_j)\) that a decrease in \(R_j^2\) leads to a decrease in \(se(\beta_j)\).

“Compute each of the determinants of the right-hand side of the expression for the standard errors. Check whether the results you obtain coincide with the standard error reported in the STATA output.”

Since we are only going to calculate it for \(j=1\) we can write that into the formula. The expression for the standard errors can also be computed using the forumla given in assignment 1 \[se(\hat \beta_1) = \sqrt{\widehat{\textrm{Var}}(\hat{\beta_1})} = \sqrt{\frac{n \hat{\sigma}^2}{n\sum x_i^2 - (\sum x_i)^2}}\]

These elements are found in the ANOVA-table.

anova(model1)
## Analysis of Variance Table
## 
## Response: colgpa
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## athlete      1   14.97 14.9696   34.79 3.966e-09 ***
## Residuals 4135 1779.23  0.4303                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we can compute \(se(\hat \beta_1)\).

# now I take out elements of the anova
model1_hatsigmau <- anova(model1)[[3]][2] #takes column 3, row 2 in the anova-table.
numerator1 <- n*model1_hatsigmau
sumathlete <- sum(gpa2$athlete)
#note that athelte is a dummy so sum(x^2)=sum(x).
sumathletesq <- sumathlete
meanathlete <- mean(gpa2$athlete)
denominator1 <- n*sumathletesq - (sumathlete)^2 
# se(beta_1) for model1 = sqrt(numerator1 /  denominator1) 
se.coef.athlete.model1_manually <- sqrt(numerator1 /  denominator1)
# rounded to 6 digits they are the same
round(se.coef.athlete.model1_manually, 6)
## [1] 0.04824
round(se.coef.athlete.model1, 6)
## athlete 
## 0.04824

So we have the correct result.


The formula above, however, only works for simple linear regression. In the multivariable case we use the formula \(se(\beta_j) = \frac{\hat \sigma_u}{SST_j(1−R_j^2)}\) or \[se(\beta_1) = \frac{\hat \sigma_u}{SST_1(1−R_1^2)}\].

anova(model2)
## Analysis of Variance Table
## 
## Response: colgpa
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## athlete      1   14.97  14.970  41.421 1.367e-10 ***
## sat          1  285.20 285.201 789.156 < 2.2e-16 ***
## Residuals 4134 1494.03   0.361                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let us extract the Sum Sq = \(SST_1\) for athlete (in both models for completeness)

anova(model1)[1,2]
## [1] 14.96959
anova(model2)[1,2]
## [1] 14.96959

PS. Indeed \(\beta_1^{model1} = \beta_1^{model2}\).

Secondly, we compute \(R_1^2\).

From theory we know that \(R_j^2\) is the \(R^2\) from regressing \(x_j\) on all the other independent variables. Our \(x_j = x_1\) in this case is athlete. Our “all other independent variables” is simply sat.

m.athletesat <- lm(athlete ~ sat, gpa2)
r.sq.1.model2 <- summary(m.athletesat)$r.sq
r.sq.1.model2
## [1] 0.03425972

So \(R_1^2\) for model2 is 0.03426.

hatsigmausq_model2 <- anova(model2)[[3]][[3]]
hatsigmau_model2 <- sqrt(hatsigmausq_model2)

SST1 <- anova(model2)[[2]][[1]]  
#SST1 is the totalt variation in athlete

#just for control we print the values
c(hatsigmausq_model2, hatsigmau_model2, SST1)
## [1]  0.3613995  0.6011651 14.9695925

Now apply formula: \(se(\beta_1) = \frac{\hat \sigma_u}{SST_1(1−R_1^2)}\)

se.beta1.m2 <- hatsigmau_model2 / (SST1 - (1-r.sq.1.model2))
se.beta1.m2 - se.coef.athlete.model2
##      athlete 
## -0.002058972

So our estimates are (wrong it turns out). The difference is small though.

“Using the insights you got from the procedure above, explain in words why adding an additional variable changes the standard errors for the estimated coefficients.”

This we explained in the first part of the question Does the standard error for the coefficient of athlete increase or decrease when sat is included in the regression? Why?“* see our reasoning regarding”we see that \(R_1^2\) must be lower for model2 than for model1, since we in model2 added the extra variable sat*."

Question 6

Please note that we answer the two questions in reverse order here, so that we are able to define formulas before we calculate.

Why would we sometimes prefer the adjusted over the unadjusted R-square?

\(R^2\) shows the linear relationship between the independent variables and the dependent variable.
Let \[SSE = SS Errors = \sum_i^n (y_i- \hat y)^2\] (i.e. what Lindqvist calls “SSR” as in “SS Residual” but we prefer to use this notation, as it is standard in statistical and mathematical litterature). The regression sum of squares, also called the explained sum of squares: \(SSR=\sum_i^n (\hat y -\bar{y})^2\). The total sum of squares is \(SST=\sum_i^n (y_i-\bar{y})^2,\) So in summary we have that \(SST = SSE + SSR\).
With our notation, we can now define \[R^2 = 1 - \frac{SSE}{SST}\] As independent variables are added \(SSR\) will continue to rise, and since \(SST\) is fixed, \(SSE\) will go down and \(R^2\) will continue to rise – no matter how valuable the variables you added are.

The Adjusted R squared is defined as \[R^2_{adj} = R^{2}-(1-R^{2}){p \over n-p-1}\] where \(p\) is the total number of regressors in the model and \(n\) is the sample size.

The Adjusted R squared \(R^2_{adj}\) is attempting to account for “statistical shrinkage”. Models with many predictors tend to perform better in sample than when tested out of sample, and the \(R^2_{adj}\) “penalizes” you for adding the extra predictor variables that don’t improve the existing model. It can thus be more helpful in model selection. The \(R^2_{adj}\) will equal \(R^2\) for one predictor variable. As you add variables, it will be smaller than \(R^2\).

Suppose the STATA output did not include the values for R-square and adjusted R-square. Show explicitly how you would compute the R-square and the adjusted R-square from the STATA output.

It is not specified in the question exactly what we do have at hand in order to do our calculations, but a reasonable guess would be that we have the Sum of Squares for our variables. This is a reasonable assumption since an example Stata output is

. reg colgpa athlete sat

      Source |       SS       df       MS              Number of obs =    4137
-------------+------------------------------           F(  2,  4134) =  415.29
       Model |  300.170186     2  150.085093           Prob > F      =  0.0000
    Residual |  1494.02549  4134  .361399489           R-squared     =  0.1673
-------------+------------------------------           Adj R-squared =  0.1669
       Total |  1794.19567  4136  .433799728           Root MSE      =  .60117

------------------------------------------------------------------------------
      colgpa |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     athlete |  -.0506146   .0449875    -1.13   0.261    -.1388143    .0375852
         sat |   .0019168   .0000682    28.09   0.000     .0017831    .0020506
       _cons |   .6800709   .0713403     9.53   0.000     .5402056    .8199362
------------------------------------------------------------------------------

\[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{1494}{1794} \approx 0.16722\]

\[R^2_{adj} = R^{2}-(1-R^{2}){p \over n-p-1} \approx 0.16722 - (1-0.16722){2 \over 4134 } \approx 0.16682\]

Note that \(0.16722 > 0.16682\) and that small rounding error have occured.

Question 7

Investigate whether the linear specification for the variable sat is appropriate. Create a new variable (call it lsat) using the logarithmic transformation for sat and reestimate model (2) as it follows: \[colgpa = \beta_0 + \beta_1 athlete + \beta_2 lsat + u \quad (3)\] Do the coefficient for sat in (2) and lsat in (3) differ?

#create varable
lsat <- log(gpa2$sat)
#regress model3
model3 <- lm(colgpa ~ athlete + lsat, gpa2)
#extract coefficients
model3$coef[2]
##     athlete 
## -0.02740717
#compare to model2
model2$coef[2]
##     athlete 
## -0.05061457
model3$coef[2] - model2$coef[2]
##    athlete 
## 0.02320741

The estimates are indeed different. (But we would have to run a t-test to see if the different is sign different from zero.)

“What about the R-squared?”

Let us compare model2 and model3. The summary table from model3 is

stargazer(model3, type = "html")
Dependent variable:
colgpa
athlete -0.027
(0.045)
lsat 1.891***
(0.069)
Constant -10.450***
(0.482)
Observations 4,137
R2 0.159
Adjusted R2 0.159
Residual Std. Error 0.604 (df = 4134)
F Statistic 391.189*** (df = 2; 4134)
Note: p<0.1; p<0.05; p<0.01

The p-value for coefficient athlete is given by summary(model4)$coef[2,4] and it’s high: \(0.2182084\).

If we compare them, we see that for model2 and model3 we had

“Which of the two specifications would you chose for your empirical work?”

Our model2 has the higher \(R^2_{adj}=\) Adjusted R-squared (and higher \(R^2 =\) Multiple R-squared since the models have the same \(n\) and \(p\)). In this case, it is more sensible to use model2.

summary(model2)$adj.r.squared - summary(model3)$adj.r.squared
## [1] 0.008167636

But the difference is small, our comparison between the \(diff \; R^2_{adj} \approx 0.008 \approx 0.010 = 1 \%\) shows that the difference is only approx 1 percent. In this instance, model2 “wins”.

Furthermore, model2 will have an easier interpretation for a non-statistician than model3 will since we have \(\log(sat)\). In this instance, model2 “wins” with a small margin

It is, however, possible that model3 better meets the assumptions of linear regression. If model2 better meets them, it “wins” again. Let us, for arguments sake, say that model3 better meets the of linear regression (this is by the way a plausible scenario since using a model with log-variables often makes the assumptions of the error terms to be met better). In this case it is up to our judgement as economists to see whichever wee feel is more relevant. Here, we would argue that since the difference in R-squared is so small and it is not that hard to explain what log(sales) is, that if model3 better meets the assumptions then we chose this one for our empirical work.

How do we evaluate the assumptions? Two example of tools to use are a residual plot (do they have mean zero? Are they homoscedastic?) or a quantile-quantile plot of the residuals (are they normally distributed?). We do not, however, do this model evaluation here. It is not part of the question to do it.

Question 8

Now let’s consider the question whether other factors mediate the relation- ship between participating in sports activities and GPA. More precisely, looking at the description of the dataset, the variable tothrs should capture our attention. Add the variable tothrs to regression (2) so that we have: \[colgpa = \beta_0 + \beta_1athlete + \beta_2 sat + \beta_3tothrs + u \quad (4)\] Report the output, interpret all the coefficients comment on how the estimated coefficients vary between model (2) and model (4).

model4 <- lm(colgpa ~ athlete + sat + tothrs, gpa2)
stargazer(model4, type = "html")
Dependent variable:
colgpa
athlete -0.055
(0.044)
sat 0.002***
(0.0001)
tothrs 0.003***
(0.0003)
Constant 0.547***
(0.072)
Observations 4,137
R2 0.186
Adjusted R2 0.185
Residual Std. Error 0.595 (df = 4133)
F Statistic 313.977*** (df = 3; 4133)
Note: p<0.1; p<0.05; p<0.01

If athlete=1 then we expect an 0.055 decrease in colgpa (but this is not significant). For each increase in sat score, we expect an 0.002 increase in colgpa. For each increase in tothrs we expect an 0.003 increase in colgpa.

Now let us compare model2 vs model4.

Below we see the two models side by side. All estimates are significant except athlete (as seen by their p-value).

stargazer(model2, model4, type = "html")
Dependent variable:
colgpa
(1) (2)
athlete -0.051 -0.055
(0.045) (0.044)
sat 0.002*** 0.002***
(0.0001) (0.0001)
tothrs 0.003***
(0.0003)
Constant 0.680*** 0.547***
(0.071) (0.072)
Observations 4,137 4,137
R2 0.167 0.186
Adjusted R2 0.167 0.185
Residual Std. Error 0.601 (df = 4134) 0.595 (df = 4133)
F Statistic 415.289*** (df = 2; 4134) 313.977*** (df = 3; 4133)
Note: p<0.1; p<0.05; p<0.01

Either by looking at the table above, or the calculations below, we see that a) The coefficient for athlete is slightly different between the models b) The coefficient for sat is almost the same different between the models

#difference between athlete estimates (extract row 2 column 1)
summary(model2)$coef[2,1] - summary(model4)$coef[2,1]
## [1] 0.004184207
#difference between at (extract row 3 column 1)
summary(model2)$coef[3,1] - summary(model4)$coef[3,1]
## [1] 3.559038e-07

To summarize: When adding tothrs to the model colgpa ~ ahtlete + sat we get an increase in the \(R^2_{adj}\) and some sall changes in the coefficients.

Question 9

We split this question into three parts: a, b, c.

9a

9.a) Using model (2), test (use the proper command in STATA) whether the coefficient of athlete is different from zero.

We use the model \[colgpa_i = \beta_0 + \beta_1 athlete_i + \beta_2 sat + \epsilon_i (2)\]

The reader who know her regression output tables have already answered this question herself, since summary(model2) from Question 4 is the table we will look at now.

There we can see that the estimate for athlete has p-value 0.261 and sat has a p-value of <2e-16. If we use the regular p-values 10% or 5% or 1% we see that the estimate for athlete is not significant and the estimate for sat is significant. (Furthermore, if we had done a one-sided test and not a two-sided test as we do here, the rejections would have been the same.) This means that we, in our model2, can reject \(H_0: \beta_2 = 0\) but we fail to reject \(H_0: \beta_1 = 0\) (in other words: \(\beta_2\) in model2 is statistically significant different from zero, whereas \(\beta_1\) in model2 is not). The intuition here is that if the null hypothesis is true, then the probability of our observing \(\hat \beta_1 = -0.05061\) is is very low and therefore we do not think the null hypothesis is true - so we reject it.

To be more mathematically precise, the test done (by our statistical software R) is the following $ H_0: _j = _j^{null} = 0 H_1: _j _j^{null} $ for \(j = {1,2}\).

The test statistic used for this test is: \[T_0=\frac{\hat{\beta}_1-\beta_1^{null}}{se(\hat{\beta}_1)}\] and under \(H_0\) we have that \(T_0 \sim t_{n-p-1}\) (where \(n-p-1\) is the degress of freedom and \(p\) is the number of parameters, in simple linear regression \(p=1\)).

In other words, the \(t\) distribution is used to test the two-sided hypothesis that the true slope \(\beta_1\) equals some constant value, \(\beta_1^{null}\).

When done by hand, you calculate \(t_{obs}\) and take its absolute value. We compare it to some \(t_{critical}\) whose value depend on the \(df\) and our chosen sign level \(\alpha\) (these t-values can be found in t-tables). The rule it that if \(abs(t_{obs}) > t_{crititcal}\) we reject the null hypothesis and otherwise we do not.

We do a two-sided test. See discussion in the introduction.

An equivalent way of testing is to use confidence intervals (the mathematics behind it is almost identical) instad of testing via \(t_{obs}\). For any parameter \(\beta_j\) we specify alpha=5% and calculate the interval, if our interval for \(\hat \beta_j\) of the parameter does not contains \(\beta_j^{null}\) then we reject \(H_0\). Often \(\beta_j^{null} = 0\).

confint(model2)
##                    2.5 %      97.5 %
## (Intercept)  0.540205566 0.819936201
## athlete     -0.138814325 0.037585180
## sat          0.001783071 0.002050625

In our case, we see that the coefficient for \(\beta_1\) (ahtlete) contains zero, whereas the interval for \(\beta_2\) (sat) does not. So at our sign level alpa = 5% we can conclude that \(\beta_2\) in model2 is statistically significant different from zero, whereas \(\beta_1\) in model2 is statistically significant different from zero

9b

9.b) Then, using the same model, test also whether the coefficient of sat is statistically different from (0.1). dels conf int täcker .1? dels proper test med h0.

The easiest way to do this is to test the hypothesis using a confidence interval. Because then we do not have to recalculate anything. In the case above \(\beta_j^{null}=1\) for \(j={1,2}\) but in this we ask if it’s different from \(0.1\) so we put up the null \[H_0: \beta_2^{null} = 0.1\] The rule is the same: If our confidence interval for \(\beta_2\) does not contains the value \(\beta_2^{null} = 0.1\) then we reject \(H_0\) otherwise we do not reject it. Since the interval is computed from earlier, this method is convient. The interval is \([0.00178, 0.00205]\) and it does not contain \(0.1\) so we can reject our new \(H_0\).

It can, of course, be computed using the t-method also. This takes more time. Simply take \(t_{obs} = \frac{\hat \beta_2 - \beta_2^{null} }{se(\hat \beta_2)} = \frac{1.917e-03 - 0.1 }{6.823e-05} \approx -1437\) Since \(abs(t_{obs})\) is much larger than \(t_{crit}\) so we reject \(H_0\) with this method as well.
We double check our calculations

(summary(model2)$coef[3,1] - 0.1 ) / summary(model2)$coef[3,2]
## [1] -1437.435

9c

9.c) Finally, using model (4), test whether the variables athlete, sat and tothrs are jointly significant.

A first guess

First of all, what does it mean that the variables are jointly significant? We could say that they, toghtether, explain a relatively “large” bit of the total variation SST. In the summary of model4 we see that both sat and tothrs are significant. Also the Adjusted R-squared is 0.185 so we suspect that the varibales athlete, sat and tothrs are jointly significant.

Let’s use statistics to test if the variables are indeed jointly significant.

The need for a baseline

In Lec3slide21to25 Lindqvist show how to do this this. But he uses a baseline. What we mean by baseline is that he has the restricted “r” model as his baseline, and compares it to a model “ur” for unrestricted. He does this comparison in order to test if the variables that are in “ur” but not in “r” are jointly significan i.e. if the variables added he added (when going from his baseline model “r” to his model “ur”) explain relatively much of the variance.

This Question 9 however asks us to test if our varibles in “model4” are jointly significant, but the question does not specify compared to what? To make the analgy to Erik’s case, his model “ur” is our “model4” but his model “r” is not specified for us. So what is our baseline “r” model? Due to the vague question we need to define a baseline.

Once that baseline has been established, we can compare the baseline to the variables and see if the variables (athlete, sat, tothrs) are jointly significant.

How do we establish a basline?

Possible baslines are: hsperc , hsrank , white , female , verbmath , hsize.

By the way, here are the correlations between our possible x-variables and colgpa.

cor(gpa2$colgpa, gpa2) 
##            sat    tothrs colgpa     athlete     verbmath       hsize
## [1,] 0.4087123 0.1346072      1 -0.09134191 -0.002438258 -0.02855914
##          hsrank    hsperc    female     white      black     hsizesq
## [1,] -0.3353691 -0.428533 0.1072948 0.1328592 -0.1502972 -0.04706429

Let us pick female as the baseline.

Model 4 and Model 44

Comparing model4 to a baseline is not really what Lindqvist did with “r” vs “ur”. In his example the baseline “r” contained ed, and “ur” contained ed iq gpa. This was used to test if iq and gpa are jointly significant. This is the right way to do it, so we will adapt it for our case. Note that model “ur” contained ed as well, and that the variables of interest were iq gpa.

To generealize Lindqvist’s approach, what we do is to create a baseline model with y = OurBaselineVariable and we wish to compare this model to y = OurBaselineVariable + x1 + x2 + x3 in order to understand if x1 x2 x3 are jointly significant.

The question 9c is vaguely written (perhaps vague with intent to confuse the student who do not know econometrics well enough). It asks us to use model4 - but since we have established that we do need a baseline and also picked female as baseline, then if we are to use Lindqvist’s approach model4 cannot be used to asked the question, becuase model4 do share any variable with the baseline. Therefore we introduce the very important model44 = model4 + OurBaselineVariable where we start with OurBaselineVariable = female. Note that we had to introduce model44 in order to follow Lindqvist’s approach.

Comparing our model to baseline

Model44 is colgpa ~ female + athlete + sat + tothrs. Our baseline model is colgpa ~ female. Below is the summary comparison.

baseline.female <- lm(colgpa ~ female, gpa2)
model44 <- lm(colgpa ~ female + athlete + sat + tothrs, gpa2)
stargazer(baseline.female, model44, type = "html")
Dependent variable:
colgpa
(1) (2)
female 0.142*** 0.224***
(0.020) (0.019)
athlete 0.014
(0.044)
sat 0.002***
(0.0001)
tothrs 0.002***
(0.0003)
Constant 2.589*** 0.305***
(0.014) (0.074)
Observations 4,137 4,137
R2 0.012 0.213
Adjusted R2 0.011 0.212
Residual Std. Error 0.655 (df = 4135) 0.585 (df = 4132)
F Statistic 48.157*** (df = 1; 4135) 279.459*** (df = 4; 4132)
Note: p<0.1; p<0.05; p<0.01

Below is the Sum of Squares comparison

aov(baseline.female)
## Call:
##    aov(formula = baseline.female)
## 
## Terms:
##                    female Residuals
## Sum of Squares    20.6551 1773.5406
## Deg. of Freedom         1      4135
## 
## Residual standard error: 0.6549118
## Estimated effects may be unbalanced
aov(model44)
## Call:
##    aov(formula = model44)
## 
## Terms:
##                    female   athlete       sat    tothrs Residuals
## Sum of Squares    20.6551   11.8629  320.4518   29.0649 1412.1610
## Deg. of Freedom         1         1         1         1      4132
## 
## Residual standard error: 0.5846042
## Estimated effects may be unbalanced

By hand

In general the statistic \[F = \frac{(SSE_1 - SSE_2)/(q)}{(SSE_2)/(n-k-1)} \sim F(q, n-k-1)\] where \(SSE_j\) is the SS Error for model \(j\) and \(j=\{ 1,2 \}\). In other words our statistic, the random variable \(F\) that we constructed, follows an \(F(q, n-k-1)\)-distribution.

Our goal is to reject \(H_0: \beta_1 = \beta_2 = \beta_3 = 0\) in favor for \(H_1:\) “at least one of the beta’s are not zero”. (Note that \(H_0\) and \(H_1\) are collectively exhaustive and mutually exclusive.) We want to reject it becuase then we can with say that the coefficients are statisticalli significant different from zero, in other words: the variabels are important to us.

The rule is, as usual, to reject the null if \(F_{obs} > F_{crit}\) at some given sign. level \(\alpha\). We pick \(\alpha = 0.05\).

We compute the F-test below.

#manual F-test at alpha=0.05
thenumerator <- (1773.54 - 1412.16)/3
thedenominator<- 1412.161 / (4136-3-1)
Fobs <- thenumerator / thedenominator
Fcrit <- qf(1-0.05, 3, n-3-1) #qf is the F table command
c(Fobs, Fcrit)
## [1] 352.467403   2.607058
Fobs > Fcrit  #If this returns TRUE then reject H_0.
## [1] TRUE

We calculate \[F_{obs} = \frac{thenumerator}{thedenominator} =\] \[\frac{(SSE_{modellbaseline} - SSE_{model44})/(q)} {(SSE_{model44})/(n-k-1)=(SSE_{model44})/(n-3-1)}\] \[\approx 352.47 \] where we have used the facts that

  • We know \(q = 3\) since model44 has 4 variables and baseline 1 variable
  • SSE modell baseline = 1773.54 (from the table above)
  • SSE model44 = 1412.161

Now compare \(F_{obs} = 352.47\) to \(F_{crit} = F_{0.05}(3,n-3-1) \approx 2.6\). We reject \(H_0\) since \(352.47 > 2.6\). So we can conclude that the variables athlete + sat + tothrs are jointly significant.

Using ANOVA

It is cumbersome to do all this by hand. Let us create an ANOVA-table. The table contains

  • all the components needed to calculate our F-statistic (the r.h.s of our formula for F above)
  • the actual Fobs (the l.h.s of our formula for F above)
  • the p-value of Fobs (so we do not have to choose alpha-level and then compute Fcrit and compare, but we can instead look directly at the p-value and judge if we beileve H0 is rejectable).

Note that \(RSS = Residual SS = SS Error = \sum_i^n (y_i- \hat y)^2\).

anova(  lm(colgpa ~ female, gpa2) ,
        lm(colgpa ~ athlete + sat + tothrs + female, gpa2)
      )
## Analysis of Variance Table
## 
## Model 1: colgpa ~ female
## Model 2: colgpa ~ athlete + sat + tothrs + female
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4135 1773.5                                  
## 2   4132 1412.2  3    361.38 352.47 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is less than 2.2e-16 we reject \(H_0\) just as before. Once again we can conclude that the variables athlete + sat + tothrs are jointly significant.

One step deeper - with an arbitrary baseline

We picked the female-variable as a baseline. This choice was arbitrary. What happens if you pick another baseline? The command we will run is

anova(lm(colgpa ~ athlete + sat + tothrs + OurBaselineVariable, gpa2) , 
      lm(colgpa ~ OurBaselineVariable, gpa2) 
      )

where OurBaselineVariable \(= { female , hsperc , hsrank , white , verbmath , hsize , hsizesq , black }\)

anova(  lm(colgpa ~ female, gpa2) ,
        lm(colgpa ~ athlete + sat + tothrs + female, gpa2)
      )
## Analysis of Variance Table
## 
## Model 1: colgpa ~ female
## Model 2: colgpa ~ athlete + sat + tothrs + female
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4135 1773.5                                  
## 2   4132 1412.2  3    361.38 352.47 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(  lm(colgpa ~ hsperc, gpa2) ,
        lm(colgpa ~ athlete + sat + tothrs + hsperc, gpa2)
      )
## Analysis of Variance Table
## 
## Model 1: colgpa ~ hsperc
## Model 2: colgpa ~ athlete + sat + tothrs + hsperc
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4135 1464.7                                  
## 2   4132 1282.5  3    182.22 195.69 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova( lm(colgpa ~ hsrank, gpa2) ,
        lm(colgpa ~ athlete + sat + tothrs + hsrank, gpa2) 
      )
## Analysis of Variance Table
## 
## Model 1: colgpa ~ hsrank
## Model 2: colgpa ~ athlete + sat + tothrs + hsrank
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4135 1592.4                                  
## 2   4132 1344.4  3       248 254.08 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(  lm(colgpa ~ white, gpa2) ,
        lm(colgpa ~ athlete + sat + tothrs + white, gpa2)
      )
## Analysis of Variance Table
## 
## Model 1: colgpa ~ white
## Model 2: colgpa ~ athlete + sat + tothrs + white
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1   4135 1762.5                                 
## 2   4132 1457.9  3    304.63 287.8 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(  lm(colgpa ~ verbmath, gpa2) ,
        lm(colgpa ~ athlete + sat + tothrs + verbmath, gpa2)
      )
## Analysis of Variance Table
## 
## Model 1: colgpa ~ verbmath
## Model 2: colgpa ~ athlete + sat + tothrs + verbmath
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4135 1794.2                                  
## 2   4132 1461.2  3       333 313.89 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(  lm(colgpa ~ hsize, gpa2) , 
        lm(colgpa ~ athlete + sat + tothrs + hsize, gpa2)
      )
## Analysis of Variance Table
## 
## Model 1: colgpa ~ hsize
## Model 2: colgpa ~ athlete + sat + tothrs + hsize
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4135 1792.7                                  
## 2   4132 1457.1  3    335.65 317.28 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(  lm(colgpa ~ hsizesq, gpa2) , 
        lm(colgpa ~ athlete + sat + tothrs + hsizesq, gpa2)
      )
## Analysis of Variance Table
## 
## Model 1: colgpa ~ hsizesq
## Model 2: colgpa ~ athlete + sat + tothrs + hsizesq
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1   4135 1790.2                                 
## 2   4132 1455.2  3    335.02 317.1 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(  lm(colgpa ~ black, gpa2) , 
        lm(colgpa ~ athlete + sat + tothrs + black, gpa2)
      )
## Analysis of Variance Table
## 
## Model 1: colgpa ~ black
## Model 2: colgpa ~ athlete + sat + tothrs + black
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4135 1753.7                                  
## 2   4132 1455.8  3    297.84 281.78 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Even though we have different F-statistics for our baselines: 352.47 313.89 287.8 254.08 195.69 all of them have very low p-values. All baselines make the same rejections. Hence it does not matter which baseline we pick.

Conclusion

The variables athlete, sat, tothrs are indeed jointly significant.

Question 10

Compute the beta coefficient for the variable sat. Why or why not it may be useful to consider the beta coefficient in this case?

With “this case” we assume previous case, i.e. model4. So the coef for sat in model4 is

model4
## 
## Call:
## lm(formula = colgpa ~ athlete + sat + tothrs, data = gpa2)
## 
## Coefficients:
## (Intercept)      athlete          sat       tothrs  
##    0.547373    -0.054799     0.001916     0.002522
summary(model4)$coef[3,c(1,3,4)] #low p-value
##      Estimate       t value      Pr(>|t|) 
##  1.916492e-03  2.839715e+01 3.325809e-162
#compare coef for athlete with sat in size
summary(model4)$coef[2,1] / summary(model4)$coef[3,1] 
## [1] -28.59327

This coefficient tells us that an increase in sat score is expected to lead to an increase with \(\hat \beta_2 = 0.001916\) increase in colgpa.

Statistically significant is not the same as relevant. The variable is statistically significant (so we are sure that it does have an effect on colgpa) but is it relevant? Is the sat score really important in predicting colgpa? Here we can look at the estiamted value, and it is rather small - 28 times smaller than \(\hat \beta_1\). We are inclined to believe that sat does not really matter.

Question 11

Regress athlete on the other set of regressors in equation (4), namely sat and tothrs. Report the output and save the residuals \(\hat \epsilon\) that you obtain from the regression.

We interpret “report the output” as to graph the residuals. Do note that athlete is a dummy variable.

modelast <- lm(athlete ~ sat + tothrs, gpa2)
res.modelast <- residuals(modelast)
#A density plot
plot(density(res.modelast)) 

From the density plot the error look normally distributed, except for the part [0.8, 1.0] which is the only place, to the right of 0.2, that density > 0.

#A residual plot on our y vs residuals
plot(gpa2$colgpa,res.modelast,xlab="colgpa")

In the quantile plot below we can see again that a jump occurs. It looks like we have two different normal distributions. This is also apparent in the plot below. So our normal distrubtion-assumption is not valid.

# A quantile normal plot for checking normality
qqnorm(res.modelast)  
qqline(res.modelast)

Then regress colgpa on \(\hat \epsilon\) such that \(colgpa= \alpha_0 +\alpha_1 \hat \epsilon + v (5)\)

model5 <- lm(colgpa ~ res.modelast, gpa2)
summary(model5)
## 
## Call:
## lm(formula = colgpa ~ res.modelast, data = gpa2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.65621 -0.44623  0.00316  0.46555  1.40259 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.65269    0.01024 259.058   <2e-16 ***
## res.modelast -0.05480    0.04929  -1.112    0.266    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6586 on 4135 degrees of freedom
## Multiple R-squared:  0.0002988,  Adjusted R-squared:  5.707e-05 
## F-statistic: 1.236 on 1 and 4135 DF,  p-value: 0.2663

Compare \(\hat \alpha_1\) from equation (5) to \(\hat \beta_1\) from equation (4). Explain in words why you get this result.

The question asks us to compare the coefficient for res.modelast in model5 with the coefficient for athlete in model4. The coefficients are:

summary(model5)$coef[2,1] #coef for res.modelast in model5
## [1] -0.05479878
summary(model4)$coef[2,1] #coef for athlete in model4
## [1] -0.05479878

So we see that \(\hat \alpha_1 = -0.05479878 = \hat \beta_1\).

Why is this? We will try to explain the intuition behind the result rather than to derive it mathematically.

colgpa = athlete + sat + tothrs + u (model4)
athletea = sat + tothrs + epsilon (modelast)
now the residuals to modelast are saved in res.modelast
colgpa = res.modelast (model5)

The result states that the coefficient in front of res.modelast is the same as the coefficient in front of athlete. We can break down our regressions using simpler notation where we abuse the + sign:

colgpa = A + S + T
A = S + T + R
colgpa = R

We see that R is the bits of A that S+T cannot explain, and these bits form a relationship (!) between A + S + T. So in the third row we regress colgpa on a variable that has a relationship with A + S + T, the regressions are in other words looking at the same kind of relationships. Therefore the coefficients are the same.


  1. Note that the variable athlete is a dummy, i.e. it takes only value 0 or 1.