In almost every funcdamental statistics class, despite being a sophomore level or grad level, we have to deal with hypothesis testing. In this write up we will see, how we can do the classical hypothesis testing (one mean, two means, more than two means) with linear regression approach. Linear regression is just a magic!!
This is the classic one sample t-test. Let us do the t-test for a data first, then we will show the regression approach of doing the same thing,
Last semester I worked as a TA for a natural resource statistics class. I have a lot of marks data. We are going to use the data obtained by the students in the final for demonstration. The nice thing about the data is that it is approximately normal. We can see the histogram, overlayed with a density plot below. The p-value of normality test (Anderson-Darling normality test) turns out to be 0.2, which is less than enough evidence against normality.
## The sample average of the marks: 150.188
## Anderson-Darling normality test : 0.198
Let us do inference for the population mean score from this data. Let us have the following hypotheses:
\[ H_0: \mu=155\ \ \ \text{and} \ \ H_a:\mu\neq155 \]
Now we can do a one sample t-test for this test. Theoretically, we have to get the test statistic, and use that statistic to make our inference. Specifically, we need the following test statistic:
\[ t=\frac{\bar{y}-\mu}{\sqrt{s^2/n}} \]
Assuming, we are all very well known with the equation above, let us perform the test with in built R function t.test
.
t.test(Marks$Score,mu=155)
##
## One Sample t-test
##
## data: Marks$Score
## t = -1.8676, df = 47, p-value = 0.06805
## alternative hypothesis: true mean is not equal to 155
## 95 percent confidence interval:
## 145.0037 155.3713
## sample estimates:
## mean of x
## 150.1875
So, we have a p-value of 0.06805. So, at 5% level, we fail to reject the null hypothesis. But at 10% level, we reject the null and conclude the populaiton mean is not equal to 155.
Now we will do the same task using regression. The task is very simple. The linear model for single population is below:
\[ y_i=\mu+\epsilon_i,\ i=1,2,\dots, n \]
which is the same as the regression model with just intercept, \(y_i=\beta_0+\epsilon_i\). Both cases, the errors are assumed to be normal. The regeression procedure in R will show the p-value for the intercept. The only problem is, the regression will show the intercept for following test:
\[ H_0:\beta_0=0\ \ \ \ H_a: \beta_0\neq 0 \]
But the hypothesized value in the t-test is different. But we can rewrite the null as \(\mu-155=0\). Thus we shift the data points to force the hypothesized value to be zero. Then we do the linear regression. Let us do that:
summary(lm((Marks$Score-155)~1 ))
##
## Call:
## lm(formula = (Marks$Score - 155) ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.188 -15.188 3.313 13.062 29.812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.813 2.577 -1.868 0.0681 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.85 on 47 degrees of freedom
The summary of the regression shows p-value for the intercept, which is exactly the same as the value we had from t-test. In fact, the exact t-test was perfomed here, in slighly different settings.
We can have two situations for this, based on the equality of the variances of the two groups. If the variances are assumed to be equal but unknown, then we do the familiar t.test using the pooled variance. We get the following test statistic:
\[ t=\frac{\bar{y}_1-\bar{y}_2}{\sqrt{s_p^2(1/n_1+1/n_2)}} \]
where,
\[ s_p^2=\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2},\ s_i^2= \frac{ \sum_{j=1}^{n_i}(y_{ij}-\bar{y}_i) ^2 }{n_i-1} \]
for the following null and alternate:
\[ H_0: \mu_1=\mu_2\ \ \ \ \text{and} \ \ \ H_a: \mu_1\neq\mu_2 \]
The test statistic above is compared with a t distribution with \(n_1+n_2-2\) degrees of freedom.
Let us do it in R for sleep data. The p-value we got is 0.079.
with(sleep, t.test(extra[group == 1], extra[group == 2], paired = F))
##
## Welch Two Sample t-test
##
## data: extra[group == 1] and extra[group == 2]
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.3654832 0.2054832
## sample estimates:
## mean of x mean of y
## 0.75 2.33
Now we will try to do the same thing using regression. All that we need to do is just run a regression taking the extra
as dependent variable and taking group as categorical independent variable. Let us do that.
mydata=sleep[, c("extra", "group")]
mydata$group=as.factor(mydata$group)
summary(lm(extra~group, data = mydata))
##
## Call:
## lm(formula = extra ~ group, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.430 -1.305 -0.580 1.455 3.170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7500 0.6004 1.249 0.2276
## group2 1.5800 0.8491 1.861 0.0792 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.899 on 18 degrees of freedom
## Multiple R-squared: 0.1613, Adjusted R-squared: 0.1147
## F-statistic: 3.463 on 1 and 18 DF, p-value: 0.07919
The summary of the regression model shows a p-value of 0.079, exactly the same as before. Actually this is for the null \(\beta_{group2}=0\), which is actually the difference of the means in two groups.
The classical ANOVA can also be performed simply using linear regression. Here we are going to compare math scores for students of three groups.
In this setting for multiple means, we have the following null and alternate:
\[ H_0: \mu_1=\mu_2=\mu_3\ \ \ \text{and }\ H_a:\text{at least one of the means is different} \]
For this ANOVA, we do the F test. We evaluate between sum of squares (BSS) and within sum of squares (WSS). We divide the metrics by respective degrees of freedom and take the ratio. The test statistic has now an F-distribution.
\[ BSS=\sum_{i=1}^{I}n_i(\bar{x}_i-\bar{x}) \ \ ,\ \ \ df=I-1\ \ ,\ \ \bar{x}=grand\ mean \]
\[ WSS=\sum_{i=1}^{I}(n_i-1)s_i^2\ \ \ , \ \ \ df=n-I \]
In above formulation, \(I\) is the number of groups. The two degrees of freedom of the resulting F-distribution is \(I-1\) and \(n-I\) repsectively.
For our case, \(I=3\), since we have three groups.
Let us do it by hand:
mdata=read.csv("math.csv", header = T)
mdata$program=paste(mdata$program)
voc=subset(mdata, program=="vocational")$math
aca=subset(mdata, program=="academic")$math
gen=subset(mdata, program=="general")$math
grandmean=mean(mdata$math);
mean_voc=mean(voc);mean_aca=mean(aca);mean_gen=mean(gen)
n_voc=length(voc); n_aca=length(aca); n_gen=length(gen)
s2_voc=var(voc); s2_aca=var(aca); s2_gen=var(gen)
I=3; n=nrow(mdata)
num_df=I-1
den_df=n-I
BSS=n_voc*(mean_voc-grandmean)^2+n_aca*(mean_aca-grandmean)^2+n_gen*(mean_gen-grandmean)^2
WSS=(n_voc-1)*s2_voc+(n_aca-1)*s2_aca+(n_gen-1)*s2_gen
Bet_var=BSS/num_df; With_var=WSS/den_df
F_stat=Bet_var/With_var
pvalue=1-pf(F_stat, num_df, den_df)
print(cbind(BSS, num_df, WSS, den_df, Bet_var,With_var, F_stat, pvalue))
## BSS num_df WSS den_df Bet_var With_var F_stat
## [1,] 4002.104 2 13463.69 197 2001.052 68.34361 29.27928
## pvalue
## [1,] 7.363998e-12
All the statistics are prinnted above. We have a super low p-value.
Now we will do the same thing for more than two means using linear regression model. All that we need to do is just include the program as a categorical varible.
summary(lm(math~program, data=mdata))
##
## Call:
## lm(formula = math ~ program, data = mdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7333 -6.4200 -0.5767 6.2667 28.5800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.7333 0.8068 70.321 < 2e-16 ***
## programgeneral -6.7111 1.4730 -4.556 9.12e-06 ***
## programvocational -10.3133 1.4205 -7.260 8.73e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.267 on 197 degrees of freedom
## Multiple R-squared: 0.2291, Adjusted R-squared: 0.2213
## F-statistic: 29.28 on 2 and 197 DF, p-value: 7.364e-12
The summary shows us the global F test statistic and the p vlaue. These are exactly the same as what we got before!
Let us see what anova
gives us:
anova(lm(math~program, data=mdata))
## Analysis of Variance Table
##
## Response: math
## Df Sum Sq Mean Sq F value Pr(>F)
## program 2 4002.1 2001.05 29.279 7.364e-12 ***
## Residuals 197 13463.7 68.34
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All the calcualted values are same as expected.
We can calculate the between group variabity and the within group variability in regression parameters. The former is just the MSR, and the later is just MSE.
\[ MSR=\frac{\sum_{i=1}^n(\hat{y}_i-\bar{y})^2}{p-1} \]
\[ MSE=\frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{n-p} \]
where, \(p\) is the number of parameters estimated.
If we want to do t-test for one sample mean, substract the hypothesized value from each observation, make a linear regression model with just intercept. The p-value associated with the intercept is the exact p-value of our hypothesis testing \((H_0:\mu=some\ value)\).
If we want to do hypothesis testing for two means, make a long data. Make a variable indicating which group they are from. Make a linear regression model taking this as a categorical variable. Then the p value associated with the categorical variable is the p value for the hypothesis test for two means \((H_0:\mu_1=\mu_2)\).
If we want to do hypothesis testing for more than two means, make a long data. Make a variable indicating which group they are from. Make a linear regression model taking this as a categorical variable. Then the p value for the lack of fit F-test is the p value for the hypothesis test for comparing more than two means \((H_0:\mu_1=\mu_2=\mu_3=\dots=\mu_I)\) .