We’re going to take a look at a t-test done with a data set that has a dichotmous outcome, meaning two numbers. Typically you’d do this to compare the means of two variables, so the use is actually limited to a single time and between two variables. Normally you would do this for a continuous relationship as well. So let’s show both.
Remember that the two variables need to be normal, or pass the central limit theorem (n>30).
library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.3
library(magrittr)
## Warning: package 'magrittr' was built under R version 3.5.2
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.5.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
We’re going to be using the “midwest” data from the ggplot 2 package. Eventually we can just go head and use whatever data we want, but this is easiest.
head(midwest, 10)
## # A tibble: 10 x 28
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702
## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496
## 3 563 BOND IL 0.022 14991 681. 14477 429
## 4 564 BOONE IL 0.017 30806 1812. 29344 127
## 5 565 BROWN IL 0.018 5836 324. 5264 547
## 6 566 BUREAU IL 0.05 35688 714. 35157 50
## 7 567 CALHO~ IL 0.017 5322 313. 5298 1
## 8 568 CARRO~ IL 0.027 16805 622. 16519 111
## 9 569 CASS IL 0.024 13437 560. 13384 16
## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559
## # ... with 20 more variables: popamerindian <int>, popasian <int>,
## # popother <int>, percwhite <dbl>, percblack <dbl>, percamerindan <dbl>,
## # percasian <dbl>, percother <dbl>, popadults <int>, perchsd <dbl>,
## # percollege <dbl>, percprof <dbl>, poppovertyknown <int>,
## # percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>
Perfect! We have an “in metro” variable that is dichotomous alongside many other continuous variables that should work out well.
Let’s start with one a one sample t-test, where we are going to compare a variable mean with another mean of our choosing. I want to see if the average percentage of the midwest population is less than 3% Asian .
t.test(midwest$percasian, mu = 3, alternative = "less")
##
## One Sample t-test
##
## data: midwest$percasian
## t = -83.662, df = 436, p-value < 2.2e-16
## alternative hypothesis: true mean is less than 3
## 95 percent confidence interval:
## -Inf 0.5367536
## sample estimates:
## mean of x
## 0.4872462
Looks like I’m right, the average population is indeed less than 3 percent asian.
What if the percentage is non-parametic, meaning non-normal. We would then utilize a wilcox signed rank test.
wilcox.test(midwest$percasian, mu = 3, alternative = "less")
##
## Wilcoxon signed rank test with continuity correction
##
## data: midwest$percasian
## V = 117, p-value < 2.2e-16
## alternative hypothesis: true location is less than 3
Let’s now do a two sample t-test, whic his pretty straight forward, we’re not comparing two different means, but both are from our data set rather than abitraily assigned a mean of interest.
To do this, I want to subset some data, particularly the black percentage from Illinois and Ohio.
sub <- midwest %>%
filter(state == "IL" | state == "OH") %>%
select(state, percblack)
summary(sub)
## state percblack
## Length:190 Min. : 0.00943
## Class :character 1st Qu.: 0.20466
## Mode :character Median : 1.30147
## Mean : 3.58947
## 3rd Qu.: 4.22224
## Max. :32.90043
Let’s visualize the data vist.
ggplot(sub, aes(state, percblack)) + geom_boxplot()
Looks like the means do differ, but not by much. Hey! let’s just see what our t.test says.
t.test(percblack ~ state, data = sub)
##
## Welch Two Sample t-test
##
## data: percblack by state
## t = 0.17125, df = 185.37, p-value = 0.8642
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.467854 1.746908
## sample estimates:
## mean in group IL mean in group OH
## 3.654094 3.514567
Yep, there isn’t really a difference here. Remember if we were to use a paired t-test it would be because we are using a group that are matched pairs.