T-Tests: One sample, Paired, Two sample

1) One Sample T-tests for the Mean

Example: Call Lengths

Answers.com claims that the mean length of all cell phone conversations in the United States is 195 seconds (3 minutes 15 seconds). One researcher believes this 195 value is outdated and that the true mean time spent on a cell phone calls is something other than 195 seconds. He collects a random sample of 100 phone call lengths from a phone company’s records and finds a sample mean of 182 seconds and a standard deviation of 245 seconds.

# EX: CELL PHONE CONVERSATIONS
mu0<-195
xbar<-182
s<-245
n<-100

# calcualte the test statistic
t_stat<-(xbar-mu0)/(s/sqrt(n))
t_stat
## [1] -0.5306122
# P-value for two-sided alternative 
pt(abs(t_stat), df=n-1, lower.tail=FALSE)*2
## [1] 0.5968758

Example: Shark Lengths

Although it is known that the white shark grows to a mean length of 21 feet, a marine biologist believes that the great white sharks off the Bermuda coast grow much longer due to unusual feeding habits. To test this claim, a number of full-grown great white sharks are captured off the Bermuda coast, measured and then set free. For the 15 sharks that were caught, a sample mean of 22.1 feet and sample standard deviation of 3.2 feet was found. A histogram of the sample data appeared to be approximately normal. Test using a significance level of 0.05

# EX: SHARKS
mu0<-21
xbar<-22.1
s<-3.2
n<-15

# calcualte the test statistic
t_stat<-(xbar-mu0)/(s/sqrt(n))
t_stat
## [1] 1.331338
# one-sided upper alternative 
pt(t_stat, df=n-1, lower.tail=FALSE)
## [1] 0.1021756

2) Paired T-test for Mean of Differences

Example: T-cells

There is evidence that T-cells participate in controlling tumor growth and that they can be harnessed to use the body’s immune system to treat cancer. One study investigated the use of a T cell-engaging antibody to recruited T cells to control tumor growth. The data below are T cell counts (1000 per microliter) at baseline and after 20 days on this antibody for 6 subjects.

Do the data give convincing evidence that the mean count of t-cells is higher after 20 days on this antibody at the 0.01 significance level?

# EX: T-cells
before<-c(0.04, 0.02, 0.00, 0.02, 0.33, 0.38)
after<-c(0.28, 0.47, 1.30, 0.25, 0.44, 1.22)
diff<-after-before
diff
## [1] 0.24 0.45 1.30 0.23 0.11 0.84
# summary statistics
mu0<-0
dbar<-mean(diff)
dbar
## [1] 0.5283333
s<-sd(diff)
s
## [1] 0.4573584
n<-length(diff)
n
## [1] 6
# calculate the test stastistic
t_stat<-(dbar-mu0)/(s/sqrt(n))
t_stat
## [1] 2.829613
# one-sided upper alternative 
pt(t_stat, df=n-1, lower.tail=FALSE)
## [1] 0.01834571
Instead of coding this from scratch you can use the function in R!
# code for passing in two vectors for a paired t-test
t.test(after, before, paired=TRUE, alternative = "greater", conf.level=.99)
## 
##  Paired t-test
## 
## data:  after and before
## t = 2.8296, df = 5, p-value = 0.01835
## alternative hypothesis: true difference in means is greater than 0
## 99 percent confidence interval:
##  -0.09995215         Inf
## sample estimates:
## mean of the differences 
##               0.5283333
# or you can pass in a single vector of difference and test against 0
t.test(diff, mu=0, alternative = "greater", conf.level=.99)
## 
##  One Sample t-test
## 
## data:  diff
## t = 2.8296, df = 5, p-value = 0.01835
## alternative hypothesis: true mean is greater than 0
## 99 percent confidence interval:
##  -0.09995215         Inf
## sample estimates:
## mean of x 
## 0.5283333

FOR YOUR HOMEWORK:

You will need to import the football dataset. Please use the following code:

football<-read.csv("https://raw.githubusercontent.com/kitadasmalley/fa2020_MATH138/main/data/football.csv",
                    header=TRUE)

head(football)
##   Trial Helium Air Difference
## 1     1     25  25          0
## 2     2     16  23         -7
## 3     3     25  18          7
## 4     4     14  16         -2
## 5     5     23  35        -12
## 6     6     29  15         14

3) Two-Sample T-test for Difference in Means

(NOT Assuming Equal Variance)

Example: Heliconia Flowers

Different varieties of the tropical flower Heliconia are fertilized by different species of hummingbirds. Over time, the lengths of the flowers and the forms of the hummingbirds’ beaks have evolved to match each other.

Data on the lengths in millimeters of two color varieties of the same species of flower on the island of Dominica can be found in the R script.

# EX: TROPICAL FLOWERS

# H. caribaea RED
red<-c(42.90, 42.01, 41.93, 43.09, 41.47, 41.69, 39.78, 
       39.63, 42.18, 40.66, 37.87, 39.16, 37.40, 38.20,
       38.10, 37.97, 38.79, 38.23, 38.87, 37.78, 38.01)

# H. caribaea YELLOW
yellow<-c(36.78, 37.02, 36.52, 36.11, 36.03, 35.45, 38.13,
          37.1, 35.17, 36.82, 36.66, 35.68, 36.03, 34.57, 34.63)

# Create a data frame
flowers<-data.frame(color=c(rep("red", length(red)),
                            rep("yellow", length(yellow))),
                    length=c(red, yellow))

# into the verse!
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Lets that a look at this problem
# Do the means of these distributions look significantly different?
ggplot(data=flowers, aes(y=length, x=color, fill=color))+
  geom_boxplot()

# Learning objectives: the pipe operator, group_by, and summarise
flowers%>%
  group_by(color)%>%
  summarise(mean=mean(length),
            sd=sd(length),
            n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   color   mean    sd     n
##   <fct>  <dbl> <dbl> <int>
## 1 red     39.8 1.91     21
## 2 yellow  36.2 0.975    15
# We can check this!
# sample stats
xbar1<-mean(red) #39.79619
s1<-sd(red) #1.910195
n1<- length(red) #21

xbar2<-mean(yellow) #36.18
s2<-sd(yellow) #0.9753241
n2<-length(yellow) #15

# standard error 
se<-sqrt(s1^2/n1 + s2^2/n2)
se #0.4870027
## [1] 0.4870027
# estimate degrees of freedom 
approxDF<-min(n1-1, n2-1)
approxDF #14
## [1] 14
# estimate a 95% confidence interval 
(xbar1-xbar2)+c(-1,1)*qt(0.975, df=approxDF)*se
## [1] 2.571674 4.660707
#2.571674 4.660707

# test for significant difference
tstat2<-(xbar1-xbar2)/se
tstat2 #7.425401
## [1] 7.425401
# two-sided 
pt(abs(tstat2), df=approxDF, lower.tail=F)*2
## [1] 3.225031e-06
# 3.225031e-06

# Now lets use the function
t.test(red, yellow)
## 
##  Welch Two Sample t-test
## 
## data:  red and yellow
## t = 7.4254, df = 31.306, p-value = 2.171e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.623335 4.609046
## sample estimates:
## mean of x mean of y 
##  39.79619  36.18000

4) ANOVA

Compare multiple means!

Example: Heliconia (extended)

Lets look at another species!

## CATEGORICAL Predictor
# H. Bihai
bihai<-c(47.12, 48.07, 46.75, 48.34, 46.81, 48.15, 47.12, 50.26, 
         46.67, 50.12, 47.43, 46.34, 46.44, 46.94, 46.64, 48.36)
# H. caribaea RED
red<-c(42.90, 42.01, 41.93, 43.09, 41.47, 41.69, 39.78, 
       39.63, 42.18, 40.66, 37.87, 39.16, 37.40, 38.20,
       38.10, 37.97, 38.79, 38.23, 38.87, 37.78, 38.01)

# H. caribaea YELLOW
yellow<-c(36.78, 37.02, 36.52, 36.11, 36.03, 35.45, 38.13,
          37.1, 35.17, 36.82, 36.66, 35.68, 36.03, 34.57, 34.63)

type<-c(rep("bihai", length(bihai)),
        rep("red", length(red)),
        rep("yellow", length(yellow)))
lengths<-c(bihai, red, yellow)

heliconia<-data.frame(type, lengths)

ggplot(heliconia, aes(y=lengths, x=type, fill=type))+
  geom_boxplot()

m2<-lm(lengths~type)
anova(m2)
## Analysis of Variance Table
## 
## Response: lengths
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## type       2 1074.13  537.06  242.86 < 2.2e-16 ***
## Residuals 49  108.36    2.21                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1