Lecture 19: Hypothesis Testing II

Are there easy grading professors?

Build a model of grade versus professor. But since there’s no point in having a “reference professor”, let’s suppress the intercept.

grades = read.csv("http://www.mosaic-web.org/go/datasets/grades.csv")
g2n = read.csv("http://www.mosaic-web.org/go/datasets/grade-to-number.csv")
courses = read.csv("http://www.mosaic-web.org/go/datasets/courses.csv")
all = merge(grades, g2n)
all = merge(all,courses)
names(all)
## [1] "sessionID"  "grade"      "sid"        "gradepoint" "dept"      
## [6] "level"      "sem"        "enroll"     "iid"
mod1 = lm(gradepoint ~ iid - 1, data=all)

A quick aside: In the above example, we used the merge() function to ‘merge’ or ‘join’ several datasets together. This will likely be a useful command during your data compilation stage for your final projects, so I’ll go over it briefly now:

help(merge)

OK, getting back to the task at hand. If we have a look at the distribution of p-values:

head(summary(mod1)$coef)
##            Estimate Std. Error  t value      Pr(>|t|)
## iidinst125 3.412500  0.2789609 12.23290  5.874239e-34
## iidinst126 3.505294  0.1353159 25.90452 1.729122e-139
## iidinst128 3.151538  0.1547397 20.36671  7.179466e-89
## iidinst129 3.553333  0.3221163 11.03121  5.402365e-28
## iidinst130 3.660000  0.2495102 14.66874  8.575574e-48
## iidinst132 3.735714  0.1054373 35.43066 3.358483e-247
# Grab the 4th column which contains the p-values
pvals1 = summary(mod1)$coef[, 4]
tally(~pvals1 < 0.05)
## 
##  TRUE FALSE 
##   359     0

We find that every one of the p-values is less than 0.05. Why is this?

It is important to consider what Null Hypothesis is being tested here? On coefficients, the null of no effect corresponds to a coefficient of zero. But no professor has a mean grade-point of zero, so we can tweak things by centering the grades on zero. That is, subtract out the mean grade so that the null hypothesis value of zero is meaningful:

all$GP = all$gradepoint-mean(all$gradepoint, na.rm=TRUE)
mod2 = lm(GP ~ -1 + iid, data=all)
# Grab the 4th column which contains the `p-values`
pvals2 = summary(mod2)$coef[,4] 
tally(~pvals2 < 0.05)
## 
##  TRUE FALSE 
##    89   270

Ah ha! We now have 89 significant p-values. Let’s look at the distribution of grades:

densityplot(~coef(mod2), groups=pvals2 < 0.05, auto.key=TRUE)

Things to notice:

  • Both positive values and negative grade values away from zero can generate p-values less than 0.05
  • Grades near zero don’t generate significant p-values
  • Some grades away from zero are not significant — that is because there aren’t very many courses for those professors.

The meaning of the t-statistic:

  • How far from zero the coefficient is
  • Measured in units of the standard error
  • We can see this from the regression table
  • Relationship between t and p-value
  • There’s something called the t-distribution which has one parameter

One sided versus two sided tests

When we calculate a p-value, there’s a choice to be made — should we count just one side or both sides in evaluating “as big or more extreme”. Many statistics texts spend a lot of time on this issue. We’re going to ignore it. And follow the Food and Drug Adminstration’s rule: Surprising is surprising, whichever way it goes.

F versus t

Historically, t came first, about 1906. F was introduced in the 1920s as a generalization of t.

In econometrics, researchers tend to use t statistics to evaluate significance. We’ll concentrate mostly on F statistics. The F is a generalization of the t, and the rule is \(t^2 = F\). We can see this pretty simply by doing:

ts = rt(10000, df=10)
Fs = rf(10000, df1=1, df2=10)
par(mfrow=c(1,3))
hist(ts, breaks=50, col='steelblue')
hist(ts^2, breaks=50, col='steelblue')
hist(Fs, breaks=50, col='steelblue')

With t, you need to look at both tails. The software does this automatically for us. With F, only the tail >1 is meaningful, but this corresponds to both tails of the t!

For both distributions, the parameter (df for t, df2 for F) counts the dimension of the residual subspace. This is effectively:

How much data you have to use in estimating the size of the residual vector and therefore the standard error of the statistic.

How to interpret the p-values

Should we take every p-value less than 0.05 to indicate a professor who is a significantly easy or hard grader? Let’s look at the distribution of p-values under our null hypothesis (i.e., that there is no relationship between a particular professor and their grading practices).

mod3 = lm(GP ~ shuffle(iid) - 1, data=all)  # Note that we are 'suffling' the professors
pvals3 = summary(mod3)$coef[,4]  # Grab the 4th column 
hist(pvals3, col='steelblue')

tally(~pvals3 < 0.05)
## 
##  TRUE FALSE 
##    16   346

Under the null hypothesis, the p-values are uniformly distributed between 0 and 1.

The Bonferroni correction

To make up for the fact that there are inevitably false-positive rejections of the null under the null hypothesis (i.e., we should reject it 5% of the time, even if it is true), the Bonferroni correction adjusts the threshold for rejection based on the number of tests. In this example, we have 359 professors, so the Bonferroni-adjusted threshold should be \(\alpha / m\) or 0.05/359:

densityplot(~coef(mod2), groups=pvals2 < (0.05/359), auto.key=TRUE)

tally(~pvals2 < (0.05/359))
## 
##  TRUE FALSE 
##    18   341

Reference

This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.