T Test

We’re going to take a look at a t-test done with a data set that has a dichotmous outcome, meaning two numbers. Typically you’d do this to compare the means of two variables, so the use is actually limited to a single time and between two variables. Normally you would do this for a continuous relationship as well. So let’s show both.

Remember that the two variables need to be normal, or pass the central limit theorem (n>30).

library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.3
library(magrittr)
## Warning: package 'magrittr' was built under R version 3.5.2
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.5.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

We’re going to be using the “midwest” data from the ggplot 2 package. Eventually we can just go head and use whatever data we want, but this is easiest.

head(midwest, 10)
## # A tibble: 10 x 28
##      PID county state  area poptotal popdensity popwhite popblack
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702
##  2   562 ALEXA~ IL    0.014    10626       759      7054     3496
##  3   563 BOND   IL    0.022    14991       681.    14477      429
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127
##  5   565 BROWN  IL    0.018     5836       324.     5264      547
##  6   566 BUREAU IL    0.05     35688       714.    35157       50
##  7   567 CALHO~ IL    0.017     5322       313.     5298        1
##  8   568 CARRO~ IL    0.027    16805       622.    16519      111
##  9   569 CASS   IL    0.024    13437       560.    13384       16
## 10   570 CHAMP~ IL    0.058   173025      2983.   146506    16559
## # ... with 20 more variables: popamerindian <int>, popasian <int>,
## #   popother <int>, percwhite <dbl>, percblack <dbl>, percamerindan <dbl>,
## #   percasian <dbl>, percother <dbl>, popadults <int>, perchsd <dbl>,
## #   percollege <dbl>, percprof <dbl>, poppovertyknown <int>,
## #   percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>

Perfect! We have an “in metro” variable that is dichotomous alongside many other continuous variables that should work out well.

Let’s start with one a one sample t-test, where we are going to compare a variable mean with another mean of our choosing. I want to see if the average percentage of the midwest population is less than 3% Asian .

t.test(midwest$percasian, mu = 3, alternative = "less")
## 
##  One Sample t-test
## 
## data:  midwest$percasian
## t = -83.662, df = 436, p-value < 2.2e-16
## alternative hypothesis: true mean is less than 3
## 95 percent confidence interval:
##       -Inf 0.5367536
## sample estimates:
## mean of x 
## 0.4872462

Looks like I’m right, the average population is indeed less than 3 percent asian.

What if the percentage is non-parametic, meaning non-normal. We would then utilize a wilcox signed rank test.

wilcox.test(midwest$percasian, mu = 3, alternative = "less")
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  midwest$percasian
## V = 117, p-value < 2.2e-16
## alternative hypothesis: true location is less than 3

Let’s now do a two sample t-test, whic his pretty straight forward, we’re not comparing two different means, but both are from our data set rather than abitraily assigned a mean of interest.

To do this, I want to subset some data, particularly the black percentage from Illinois and Ohio.

sub <- midwest %>% 
  filter(state == "IL" | state == "OH") %>%
  select(state, percblack)
summary(sub)
##     state             percblack       
##  Length:190         Min.   : 0.00943  
##  Class :character   1st Qu.: 0.20466  
##  Mode  :character   Median : 1.30147  
##                     Mean   : 3.58947  
##                     3rd Qu.: 4.22224  
##                     Max.   :32.90043

Let’s visualize the data vist.

ggplot(sub, aes(state, percblack)) + geom_boxplot()

Looks like the means do differ, but not by much. Hey! let’s just see what our t.test says.

t.test(percblack ~ state, data = sub)
## 
##  Welch Two Sample t-test
## 
## data:  percblack by state
## t = 0.17125, df = 185.37, p-value = 0.8642
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.467854  1.746908
## sample estimates:
## mean in group IL mean in group OH 
##         3.654094         3.514567

Yep, there isn’t really a difference here. Remember if we were to use a paired t-test it would be because we are using a group that are matched pairs.