T-TESTING: Differences between groups

The essence of hypothesis testing is asking “are there differences between group X and group Y”?

Does the group that gets the medicine get well versus the group that doesn’t get the medicine?

Does the school with more teachers do better than the school with fewer teachers? Etc.

There is a simple way to compare outcomes for two groups in R.

Here, we can examine MPG versus the group “AM” (type of transmission) in MTCARS:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pastecs)
## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
cars<-mtcars

boxplot(mpg~am,data=cars)

Just from looking at this, we can see the MPG’s tend to be different depending on if the car has automatic transmission (AM=0) versus manual transmission (AM=1).

The trick that makes this comparison so easy is that transmissions (AM) are divided into discreet groups (0,1,yes,no). The graph would not make much sense if the measurement for transmissions was continuous (1.1,1.2,1.3, and so forth)

As we have discussed with other graphs in the past, this image can be rendered mathematically, with something called a “T-test”:

t.test(mpg~am,data = cars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

This t-test is called a “TWO SAMPLE” T-test because we are comparing some dependent variable (mpg) based on a second independent variable (am), so there are “TWO SAMPLES”.

This is also the first time in the course that we can really test a hypothesis. T-testing is a step beyond correlation: it’s a way of asking “does X vary significantly when accounting for Y?” instead of just asking “is there a relationship between x and y?”

The null hypothesis that a t-test is “testing” is basically “the difference in means between two groups is equal to zero”. In the case above, the p value is less than 0.05, so the null hypothesis can be rejected with some confidence (the true difference in means is not equal to zero).

stat.desc(mtcars$hp)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##   32.0000000    0.0000000    0.0000000   52.0000000  335.0000000  283.0000000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
## 4694.0000000  123.0000000  146.6875000   12.1203173   24.7195501 4700.8669355 
##      std.dev     coef.var 
##   68.5628685    0.4674077
stat.desc(mtcars$am)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##  32.00000000  19.00000000   0.00000000   0.00000000   1.00000000   1.00000000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##  13.00000000   0.00000000   0.40625000   0.08820997   0.17990541   0.24899194 
##      std.dev     coef.var 
##   0.49899092   1.22828533

HOW TO INTERPRET A T-TEST

Null hypothesis: there is NO difference in mean HP between automatic and manual transmissions

Alternative hypothesis: there IS a difference in mean HP between automatic and manual transmissions

Dependent variable (what we are trying to measure): HP (horsepower)

Independent variable (what we believe is affecting our measurement): AM (transmission type)

t.test(hp~am,data=mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  hp by am
## t = 1.2662, df = 18.715, p-value = 0.221
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -21.87858  88.71259
## sample estimates:
## mean in group 0 mean in group 1 
##        160.2632        126.8462

p-value: the measurement of how statistically significant the test is (less than 0.05 is generally “significant”)

“T statistic” = connected closely with p-value - greater than “|2|” typically means the result is significant, less than that, insignificant.

95 percent confidence interval = there is a 95% chance that the difference between groups is within this range. Generally speaking, if the range includes “0” (zero), then the test is not significant (it would mean there is a 95% chance that the difference is zero - ie., that they are the same)

mean in group… = this gives you the estimate of the means between the two groups (in this case, automatic transmission versus manual transmission)

T-Tests are parametric, meaning that they assume normality (in this case, normality of differences between scores).

T-TEST AS LINEAR MODEL

The neat thing about a T-Test is that it can be re-written as a “linear model” (regression equation). This may not mean much to you now, but regression is basically the main tool that we have to test hypotheses in modern science. For the most part, you will be using regression in your studies from here on out.

As a side note: regression can be called different things. Sometimes you will hear people refer to it as “linear models”, “generalized linear models” (GLM) and so forth. We’ll explain why they are considered “linear” in an upcoming class. In the meantime, let’s see why T-Tests and linear models are basically the same thing.

Let’s call a “lm” (linear model) to get the same results as the T-test above:

hp_am_model<-lm(hp~am,data=cars)

summary(hp_am_model)
## 
## Call:
## lm(formula = hp ~ am, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -98.26 -51.51 -15.35  25.99 208.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   160.26      15.51  10.333 2.12e-11 ***
## am            -33.42      24.33  -1.373     0.18    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.6 on 30 degrees of freedom
## Multiple R-squared:  0.05915,    Adjusted R-squared:  0.02779 
## F-statistic: 1.886 on 1 and 30 DF,  p-value: 0.1798

This may seem confusing, but let’s go through it together.

The first thing you should notice is that, like the T-test, the variable “AM” (trasmission) is not significant. It has a p-value of 0.18 (compare this to the nearly similar p-value of 0.22 for the T-test above). The interesting thing here, though, is that the p-value for AM isn’t the p-value of the whole test, it’s just the p-value for AM (which is why it’s a little different than the p-value for the whole T-test).

You’ll also notice that the two means for each group (AM and NOT-AM) are present in the regression results. In this case, the “intercept” is the NOT-AM group. The “estimate” is 160.26 (the same as the estimated mean from the T-test). The estimated mean for AM is “-33.42”, which seems strange, but actually means “160.26-33.42” which is 126.84 (the same as the AM group mean from the T-test).

So, you see, we’ve learned from the linear model the same things that we could learn from the T-test by itself: the estimated mean HP of NOT-AM is 160, the estimated mean HP of AM is 126, but the differences are not significant (p-value=0.18). The T-statistic for AM is also less than |2|, meaning that it is likely insignificant.

So why bother with the linear model? Why not just use the t-test?

Linear model advantages

The short answer is that the linear model gives us more information than a simple t-test.

The slightly longer answer is that the linear model can allow us to “T-Test” multiple variables at the same time.

Let’s discuss the very short answer first:

summary(hp_am_model)
## 
## Call:
## lm(formula = hp ~ am, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -98.26 -51.51 -15.35  25.99 208.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   160.26      15.51  10.333 2.12e-11 ***
## am            -33.42      24.33  -1.373     0.18    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.6 on 30 degrees of freedom
## Multiple R-squared:  0.05915,    Adjusted R-squared:  0.02779 
## F-statistic: 1.886 on 1 and 30 DF,  p-value: 0.1798

If you look at the bottom right-hand side of the SUMMARY() output, you’ll see something called an “Adjusted R-squared”.

In this case, the adjusted R-squared is basically the percentage of the intercept (HP) that is explained by AM. Obviously, AM is insignificant, but if it WAS significant, we could say “the adjusted R-squared for HP ~ AM is 0.027”. This is similar to saying “AM is responsible for 3% (0.027) of the variance in HP”. Sometimes you will also hear adjusted R-squared called the “effect size”.

This may seem confusing now but it will make more sense over the next few weeks. Basically, regression allows us to indirectly estimate levels of cause and effect. We can create a model (even a very simple model like HP vs AM) and see how one variable affects another. This is the crux of statistical experimentation.

T-TEST ASSUMPTIONS

By now, you’ve gathered that T-tests and linear models are very similar – bordering on exactly the same. In fact, you might say that T-tests are really just special applications of a much more “generalized” linear model. Consequently, T-tests and linear models have similar assumptions as well.

  1. The first assumption is that the observations are independent. There is no mathematical test for this – you just have to plausibly show that there is no reason for your independent variables to be overtly connected to each other. Consider doing a correlation test on your independent variables for evidence to back this up.

  2. Another assumption is the normality of the data in each group (for small sample sizes).

We can test normality of data by group here:

ggplot(cars,aes(x=hp)) + geom_histogram(aes(y=..density..),binwidth=35) + geom_density(aes(y=..density..),color="red",size=3)+ facet_wrap(.~am) 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The outcomes for AM=1 are fairly skewed (non-normal), so typically we would either throw this out, transform or note how different it is from normal in our write-up.

  1. The third assumption is equality of variances for outcome variables for each level:
cars<-cars %>% mutate(am_factor=as.factor(am))
leveneTest(mpg~am_factor,data=cars)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  4.1876 0.04957 *
##       30                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, the levene test is just barely significant, which means the variances are just slightly different. You can run this test and it will probably be fine. Just report that it’s slightly significant.

  1. Measurement scale: the dependent variable should be continuous

Lets look at some data we haven’t seen before:

juice<-as.data.frame(ToothGrowth)
tibble(juice)
## # A tibble: 60 × 3
##      len supp   dose
##    <dbl> <fct> <dbl>
##  1   4.2 VC      0.5
##  2  11.5 VC      0.5
##  3   7.3 VC      0.5
##  4   5.8 VC      0.5
##  5   6.4 VC      0.5
##  6  10   VC      0.5
##  7  11.2 VC      0.5
##  8  11.2 VC      0.5
##  9   5.2 VC      0.5
## 10   7   VC      0.5
## # ℹ 50 more rows

This is measuring the length of tooth growth in guinea pigs that have been given either Vitamin C (in column “supp”) or Orange Juice.

Do you think teeth will grow faster with vitamin C by itself or orange juice? Personally, I think orange juice because it has calcium.

  1. is the data independent? Yes, they are different guinea pigs

  2. is the data normal?

ggplot(juice,aes(x=len)) + geom_histogram(aes(y=..density..),binwidth=2) + geom_density(aes(y=..density..),color="red",size=3)+ facet_wrap(.~supp) + ggtitle("Normality of OJ vs. Vitamin C for tooth growth")

It’s so-so, but with 60 observations it’s (probably) okay (large datasets don’t need to be perfectly normal)

  1. are variances equal across outcome variables for each level?
leveneTest(len~supp,data=juice)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  1.2136 0.2752
##       58

Levene’s test is totally insignificant, so we can say that the variances are indistinguishable by group.

  1. is the data continuous? Yes, it’s a measurement of length.

Lets run the test:

t.test(len~supp,data=juice)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

This one is tricky. The p-value is barely insignificant and the t-statistic is a little smaller than 2. The confidence interval just barely encompasses zero. All this together means that the computer cannot tell the difference between mean tooth growth between vitamin C and orange juice.

In other words, we have basically answered our hypothesis: “There is no observable difference in tooth growth between dosages of vitamin C and orange juice”.

This same t-test can be run as a linear model, as we stated previously:

juice_model<-lm(len~supp,data=juice)
summary(juice_model)
## 
## Call:
## lm(formula = len ~ supp, data = juice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7633  -5.7633   0.4367   5.5867  16.9367 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.663      1.366  15.127   <2e-16 ***
## suppVC        -3.700      1.932  -1.915   0.0604 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.482 on 58 degrees of freedom
## Multiple R-squared:  0.05948,    Adjusted R-squared:  0.04327 
## F-statistic: 3.668 on 1 and 58 DF,  p-value: 0.06039

Esentially the same as the t-test. The independent variable “suppVC” implies that the intercept is “suppOJ” so we can easily see that the estimated mean for OJ tooth growth is 20.663, while the estimated tooth growth for Vitamin C was “-3.7” from 20.663 or 16.963. Even then, the p-value for vitamin c is still 0.06 so it’s not really significant (or on the “edge” of significance)

There is another variable, however: the dosage of orange juice and vitamin C:

dose_table<-juice %>% group_by(supp,dose) %>% summarize(n=n())
## `summarise()` has grouped output by 'supp'. You can override using the
## `.groups` argument.
dose_table
## # A tibble: 6 × 3
## # Groups:   supp [2]
##   supp   dose     n
##   <fct> <dbl> <int>
## 1 OJ      0.5    10
## 2 OJ      1      10
## 3 OJ      2      10
## 4 VC      0.5    10
## 5 VC      1      10
## 6 VC      2      10

We see that both OJ and vitamin C have three doses: 0.5, 1 and 2. We can add the dosage to the linear model in addition to the OJ/VC factor.

juice_model2<-lm(len~supp+dose,data=juice)

summary(juice_model2)
## 
## Call:
## lm(formula = len ~ supp + dose, data = juice)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.2725     1.2824   7.231 1.31e-09 ***
## suppVC       -3.7000     1.0936  -3.383   0.0013 ** 
## dose          9.7636     0.8768  11.135 6.31e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared:  0.7038, Adjusted R-squared:  0.6934 
## F-statistic: 67.72 on 2 and 57 DF,  p-value: 8.716e-16

The first thing you’ll notice is that “suppVC” is now significant. Dose is very significant. How did this happen? Adding the “dose” variable allowed the computer to calculate the effects better. There really is a difference between OJ and vitamin C – when you take dosage into account.

Instead of having one variable be “suppVC”, lets make OJ an explicit categorical variable:

juice_data_extended<-juice %>% mutate(OJ_variable=ifelse(supp=="OJ",1,0),VC_variable=ifelse(supp=="VC",1,0))

juice_extended_model<-lm(len~OJ_variable+dose,data=juice_data_extended)

summary(juice_extended_model)
## 
## Call:
## lm(formula = len ~ OJ_variable + dose, data = juice_data_extended)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.5725     1.2824   4.345 5.79e-05 ***
## OJ_variable   3.7000     1.0936   3.383   0.0013 ** 
## dose          9.7636     0.8768  11.135 6.31e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared:  0.7038, Adjusted R-squared:  0.6934 
## F-statistic: 67.72 on 2 and 57 DF,  p-value: 8.716e-16

Now its much more clear the effect that OJ is having on tooth growth. Still, the most important variable is dosage. Notice that when dosage is added, the R-squared rises to 0.69, that is, this model accounts for 69% of the variance in tooth growth. The model itself is also quite significant. Taken together, this suggests that a more accurate answer to our question is “OJ is better for tooth growth than just vitamin C, but the most important variable is the dosage.”

Homework

  1. Complete brief annotated literature review and submit via Canvas!