The essence of hypothesis testing is asking “are there differences between group X and group Y”?
Does the group that gets the medicine get well versus the group that doesn’t get the medicine?
Does the school with more teachers do better than the school with fewer teachers? Etc.
There is a simple way to compare outcomes for two groups in R.
Here, we can examine MPG versus the group “AM” (type of transmission) in MTCARS:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pastecs)
##
## Attaching package: 'pastecs'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## The following object is masked from 'package:tidyr':
##
## extract
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
cars<-mtcars
boxplot(mpg~am,data=cars)
Just from looking at this, we can see the MPG’s tend to be different depending on if the car has automatic transmission (AM=0) versus manual transmission (AM=1).
The trick that makes this comparison so easy is that transmissions (AM) are divided into discreet groups (0,1,yes,no). The graph would not make much sense if the measurement for transmissions was continuous (1.1,1.2,1.3, and so forth)
As we have discussed with other graphs in the past, this image can be rendered mathematically, with something called a “T-test”:
t.test(mpg~am,data = cars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
This t-test is called a “TWO SAMPLE” T-test because we are comparing some dependent variable (mpg) based on a second independent variable (am), so there are “TWO SAMPLES”.
This is also the first time in the course that we can really test a hypothesis. T-testing is a step beyond correlation: it’s a way of asking “does X vary significantly when accounting for Y?” instead of just asking “is there a relationship between x and y?”
The null hypothesis that a t-test is “testing” is basically “the difference in means between two groups is equal to zero”. In the case above, the p value is less than 0.05, so the null hypothesis can be rejected with some confidence (the true difference in means is not equal to zero).
stat.desc(mtcars$hp)
## nbr.val nbr.null nbr.na min max range
## 32.0000000 0.0000000 0.0000000 52.0000000 335.0000000 283.0000000
## sum median mean SE.mean CI.mean.0.95 var
## 4694.0000000 123.0000000 146.6875000 12.1203173 24.7195501 4700.8669355
## std.dev coef.var
## 68.5628685 0.4674077
stat.desc(mtcars$am)
## nbr.val nbr.null nbr.na min max range
## 32.00000000 19.00000000 0.00000000 0.00000000 1.00000000 1.00000000
## sum median mean SE.mean CI.mean.0.95 var
## 13.00000000 0.00000000 0.40625000 0.08820997 0.17990541 0.24899194
## std.dev coef.var
## 0.49899092 1.22828533
Null hypothesis: there is NO difference in mean HP between automatic and manual transmissions
Alternative hypothesis: there IS a difference in mean HP between automatic and manual transmissions
Dependent variable (what we are trying to measure): HP (horsepower)
Independent variable (what we believe is affecting our measurement): AM (transmission type)
t.test(hp~am,data=mtcars)
##
## Welch Two Sample t-test
##
## data: hp by am
## t = 1.2662, df = 18.715, p-value = 0.221
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -21.87858 88.71259
## sample estimates:
## mean in group 0 mean in group 1
## 160.2632 126.8462
p-value: the measurement of how statistically significant the test is (less than 0.05 is generally “significant”)
“T statistic” = connected closely with p-value - greater than “|2|” typically means the result is significant, less than that, insignificant.
95 percent confidence interval = there is a 95% chance that the difference between groups is within this range. Generally speaking, if the range includes “0” (zero), then the test is not significant (it would mean there is a 95% chance that the difference is zero - ie., that they are the same)
mean in group… = this gives you the estimate of the means between the two groups (in this case, automatic transmission versus manual transmission)
T-Tests are parametric, meaning that they assume normality (in this case, normality of differences between scores).
The neat thing about a T-Test is that it can be re-written as a “linear model” (regression equation). This may not mean much to you now, but regression is basically the main tool that we have to test hypotheses in modern science. For the most part, you will be using regression in your studies from here on out.
As a side note: regression can be called different things. Sometimes you will hear people refer to it as “linear models”, “generalized linear models” (GLM) and so forth. We’ll explain why they are considered “linear” in an upcoming class. In the meantime, let’s see why T-Tests and linear models are basically the same thing.
Let’s call a “lm” (linear model) to get the same results as the T-test above:
hp_am_model<-lm(hp~am,data=cars)
summary(hp_am_model)
##
## Call:
## lm(formula = hp ~ am, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.26 -51.51 -15.35 25.99 208.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 160.26 15.51 10.333 2.12e-11 ***
## am -33.42 24.33 -1.373 0.18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.6 on 30 degrees of freedom
## Multiple R-squared: 0.05915, Adjusted R-squared: 0.02779
## F-statistic: 1.886 on 1 and 30 DF, p-value: 0.1798
This may seem confusing, but let’s go through it together.
The first thing you should notice is that, like the T-test, the variable “AM” (trasmission) is not significant. It has a p-value of 0.18 (compare this to the nearly similar p-value of 0.22 for the T-test above). The interesting thing here, though, is that the p-value for AM isn’t the p-value of the whole test, it’s just the p-value for AM (which is why it’s a little different than the p-value for the whole T-test).
You’ll also notice that the two means for each group (AM and NOT-AM) are present in the regression results. In this case, the “intercept” is the NOT-AM group. The “estimate” is 160.26 (the same as the estimated mean from the T-test). The estimated mean for AM is “-33.42”, which seems strange, but actually means “160.26-33.42” which is 126.84 (the same as the AM group mean from the T-test).
So, you see, we’ve learned from the linear model the same things that we could learn from the T-test by itself: the estimated mean HP of NOT-AM is 160, the estimated mean HP of AM is 126, but the differences are not significant (p-value=0.18). The T-statistic for AM is also less than |2|, meaning that it is likely insignificant.
So why bother with the linear model? Why not just use the t-test?
The short answer is that the linear model gives us more information than a simple t-test.
The slightly longer answer is that the linear model can allow us to “T-Test” multiple variables at the same time.
Let’s discuss the very short answer first:
summary(hp_am_model)
##
## Call:
## lm(formula = hp ~ am, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.26 -51.51 -15.35 25.99 208.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 160.26 15.51 10.333 2.12e-11 ***
## am -33.42 24.33 -1.373 0.18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.6 on 30 degrees of freedom
## Multiple R-squared: 0.05915, Adjusted R-squared: 0.02779
## F-statistic: 1.886 on 1 and 30 DF, p-value: 0.1798
If you look at the bottom right-hand side of the SUMMARY() output, you’ll see something called an “Adjusted R-squared”.
In this case, the adjusted R-squared is basically the percentage of the intercept (HP) that is explained by AM. Obviously, AM is insignificant, but if it WAS significant, we could say “the adjusted R-squared for HP ~ AM is 0.027”. This is similar to saying “AM is responsible for 3% (0.027) of the variance in HP”. Sometimes you will also hear adjusted R-squared called the “effect size”.
This may seem confusing now but it will make more sense over the next few weeks. Basically, regression allows us to indirectly estimate levels of cause and effect. We can create a model (even a very simple model like HP vs AM) and see how one variable affects another. This is the crux of statistical experimentation.
By now, you’ve gathered that T-tests and linear models are very similar – bordering on exactly the same. In fact, you might say that T-tests are really just special applications of a much more “generalized” linear model. Consequently, T-tests and linear models have similar assumptions as well.
The first assumption is that the observations are independent. There is no mathematical test for this – you just have to plausibly show that there is no reason for your independent variables to be overtly connected to each other. Consider doing a correlation test on your independent variables for evidence to back this up.
Another assumption is the normality of the data in each group (for small sample sizes).
We can test normality of data by group here:
ggplot(cars,aes(x=hp)) + geom_histogram(aes(y=..density..),binwidth=35) + geom_density(aes(y=..density..),color="red",size=3)+ facet_wrap(.~am)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The outcomes for AM=1 are fairly skewed (non-normal), so typically we would either throw this out, transform or note how different it is from normal in our write-up.
cars<-cars %>% mutate(am_factor=as.factor(am))
leveneTest(mpg~am_factor,data=cars)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 4.1876 0.04957 *
## 30
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, the levene test is just barely significant, which means the variances are just slightly different. You can run this test and it will probably be fine. Just report that it’s slightly significant.
Lets look at some data we haven’t seen before:
juice<-as.data.frame(ToothGrowth)
tibble(juice)
## # A tibble: 60 × 3
## len supp dose
## <dbl> <fct> <dbl>
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7 VC 0.5
## # ℹ 50 more rows
This is measuring the length of tooth growth in guinea pigs that have been given either Vitamin C (in column “supp”) or Orange Juice.
Do you think teeth will grow faster with vitamin C by itself or orange juice? Personally, I think orange juice because it has calcium.
is the data independent? Yes, they are different guinea pigs
is the data normal?
ggplot(juice,aes(x=len)) + geom_histogram(aes(y=..density..),binwidth=2) + geom_density(aes(y=..density..),color="red",size=3)+ facet_wrap(.~supp) + ggtitle("Normality of OJ vs. Vitamin C for tooth growth")
It’s so-so, but with 60 observations it’s (probably) okay (large datasets don’t need to be perfectly normal)
leveneTest(len~supp,data=juice)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.2136 0.2752
## 58
Levene’s test is totally insignificant, so we can say that the variances are indistinguishable by group.
Lets run the test:
t.test(len~supp,data=juice)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
This one is tricky. The p-value is barely insignificant and the t-statistic is a little smaller than 2. The confidence interval just barely encompasses zero. All this together means that the computer cannot tell the difference between mean tooth growth between vitamin C and orange juice.
In other words, we have basically answered our hypothesis: “There is no observable difference in tooth growth between dosages of vitamin C and orange juice”.
This same t-test can be run as a linear model, as we stated previously:
juice_model<-lm(len~supp,data=juice)
summary(juice_model)
##
## Call:
## lm(formula = len ~ supp, data = juice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7633 -5.7633 0.4367 5.5867 16.9367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.663 1.366 15.127 <2e-16 ***
## suppVC -3.700 1.932 -1.915 0.0604 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.482 on 58 degrees of freedom
## Multiple R-squared: 0.05948, Adjusted R-squared: 0.04327
## F-statistic: 3.668 on 1 and 58 DF, p-value: 0.06039
Esentially the same as the t-test. The independent variable “suppVC” implies that the intercept is “suppOJ” so we can easily see that the estimated mean for OJ tooth growth is 20.663, while the estimated tooth growth for Vitamin C was “-3.7” from 20.663 or 16.963. Even then, the p-value for vitamin c is still 0.06 so it’s not really significant (or on the “edge” of significance)
There is another variable, however: the dosage of orange juice and vitamin C:
dose_table<-juice %>% group_by(supp,dose) %>% summarize(n=n())
## `summarise()` has grouped output by 'supp'. You can override using the
## `.groups` argument.
dose_table
## # A tibble: 6 × 3
## # Groups: supp [2]
## supp dose n
## <fct> <dbl> <int>
## 1 OJ 0.5 10
## 2 OJ 1 10
## 3 OJ 2 10
## 4 VC 0.5 10
## 5 VC 1 10
## 6 VC 2 10
We see that both OJ and vitamin C have three doses: 0.5, 1 and 2. We can add the dosage to the linear model in addition to the OJ/VC factor.
juice_model2<-lm(len~supp+dose,data=juice)
summary(juice_model2)
##
## Call:
## lm(formula = len ~ supp + dose, data = juice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.2725 1.2824 7.231 1.31e-09 ***
## suppVC -3.7000 1.0936 -3.383 0.0013 **
## dose 9.7636 0.8768 11.135 6.31e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6934
## F-statistic: 67.72 on 2 and 57 DF, p-value: 8.716e-16
The first thing you’ll notice is that “suppVC” is now significant. Dose is very significant. How did this happen? Adding the “dose” variable allowed the computer to calculate the effects better. There really is a difference between OJ and vitamin C – when you take dosage into account.
Instead of having one variable be “suppVC”, lets make OJ an explicit categorical variable:
juice_data_extended<-juice %>% mutate(OJ_variable=ifelse(supp=="OJ",1,0),VC_variable=ifelse(supp=="VC",1,0))
juice_extended_model<-lm(len~OJ_variable+dose,data=juice_data_extended)
summary(juice_extended_model)
##
## Call:
## lm(formula = len ~ OJ_variable + dose, data = juice_data_extended)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5725 1.2824 4.345 5.79e-05 ***
## OJ_variable 3.7000 1.0936 3.383 0.0013 **
## dose 9.7636 0.8768 11.135 6.31e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6934
## F-statistic: 67.72 on 2 and 57 DF, p-value: 8.716e-16
Now its much more clear the effect that OJ is having on tooth growth. Still, the most important variable is dosage. Notice that when dosage is added, the R-squared rises to 0.69, that is, this model accounts for 69% of the variance in tooth growth. The model itself is also quite significant. Taken together, this suggests that a more accurate answer to our question is “OJ is better for tooth growth than just vitamin C, but the most important variable is the dosage.”