This chapter will explain more in detail about independent t-test on the base of being aware of the concepts, t-test, Single test, p-value, SD and so on. If you don’t know things above, then please go to (http://rpubs.com/Evan_Jung/292325)
Independent t-test is to compare two different means from the two unrelated samples.
the examples are like between Males and Females, Treatment vs. control, and patient vs. healthy.
When to use? The answer is that when you want to compare the means for two independent groups.
# View the wm_t dataset
wm <- read.csv(file = "Data Files/wm.csv", stringsAsFactors = FALSE)
wm_t <- subset(wm, wm$train == 1)
# Create subsets for each training time
wm_t08 <- subset(wm_t, wm_t$cond == "t08")
wm_t12 <- subset(wm_t, wm_t$cond == "t12")
wm_t17 <- subset(wm_t, wm_t$cond == "t17")
wm_t19 <- subset(wm_t, wm_t$cond == "t19")
# Summary statistics for the change in training scores before and after training
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(wm_t08)
## Warning: 강제형변환에 의해 생성된 NA 입니다
## Warning in FUN(newX[, i], ...): min에 전달되는 인자들 중 누락이 있어 Inf를
## 반환합니다
## Warning in FUN(newX[, i], ...): max에 전달되는 인자들 중 누락이 있어 -Inf를
## 반환합니다
## vars n mean sd median trimmed mad min max range skew kurtosis
## cond* 1 20 NaN NA NA NaN NA Inf -Inf -Inf NA NA
## pre 2 20 10.05 1.50 10.0 10.06 1.48 8 12 4 0.01 -1.53
## post 3 20 11.40 2.14 11.5 11.50 2.22 7 15 8 -0.25 -0.84
## gain 4 20 1.35 1.23 1.0 1.44 1.48 -1 3 4 -0.32 -0.82
## train 5 20 1.00 0.00 1.0 1.00 0.00 1 1 0 NaN NaN
## se
## cond* NA
## pre 0.34
## post 0.48
## gain 0.27
## train 0.00
describe(wm_t12)
## Warning in describe(wm_t12): 강제형변환에 의해 생성된 NA 입니다
## Warning in FUN(newX[, i], ...): min에 전달되는 인자들 중 누락이 있어 Inf를
## 반환합니다
## Warning in FUN(newX[, i], ...): max에 전달되는 인자들 중 누락이 있어 -Inf를
## 반환합니다
## vars n mean sd median trimmed mad min max range skew kurtosis
## cond* 1 20 NaN NA NA NaN NA Inf -Inf -Inf NA NA
## pre 2 20 9.9 1.45 10 9.88 1.48 8 12 4 0.16 -1.43
## post 3 20 12.5 1.88 12 12.38 2.22 10 17 7 0.48 -0.54
## gain 4 20 2.6 1.27 2 2.50 0.00 0 5 5 0.44 -0.54
## train 5 20 1.0 0.00 1 1.00 0.00 1 1 0 NaN NaN
## se
## cond* NA
## pre 0.32
## post 0.42
## gain 0.28
## train 0.00
describe(wm_t17)
## Warning in describe(wm_t17): 강제형변환에 의해 생성된 NA 입니다
## Warning in FUN(newX[, i], ...): min에 전달되는 인자들 중 누락이 있어 Inf를
## 반환합니다
## Warning in FUN(newX[, i], ...): max에 전달되는 인자들 중 누락이 있어 -Inf를
## 반환합니다
## vars n mean sd median trimmed mad min max range skew kurtosis
## cond* 1 20 NaN NA NA NaN NA Inf -Inf -Inf NA NA
## pre 2 20 10.0 1.34 10 10.00 1.48 8 12 4 0.25 -1.34
## post 3 20 14.4 1.85 14 14.25 1.48 12 19 7 0.63 -0.27
## gain 4 20 4.4 1.39 4 4.25 1.48 3 7 4 0.64 -1.12
## train 5 20 1.0 0.00 1 1.00 0.00 1 1 0 NaN NaN
## se
## cond* NA
## pre 0.30
## post 0.41
## gain 0.31
## train 0.00
describe(wm_t19)
## Warning in describe(wm_t19): 강제형변환에 의해 생성된 NA 입니다
## Warning in FUN(newX[, i], ...): min에 전달되는 인자들 중 누락이 있어 Inf를
## 반환합니다
## Warning in FUN(newX[, i], ...): max에 전달되는 인자들 중 누락이 있어 -Inf를
## 반환합니다
## vars n mean sd median trimmed mad min max range skew kurtosis
## cond* 1 20 NaN NA NA NaN NA Inf -Inf -Inf NA NA
## pre 2 20 10.15 1.27 10.0 10.19 1.48 8 12 4 0.03 -1.10
## post 3 20 15.75 1.86 16.0 15.69 1.48 13 19 6 0.16 -1.03
## gain 4 20 5.60 1.73 5.5 5.50 2.22 3 9 6 0.36 -0.76
## train 5 20 1.00 0.00 1.0 1.00 0.00 1 1 0 NaN NaN
## se
## cond* NA
## pre 0.28
## post 0.42
## gain 0.39
## train 0.00
# Create a boxplot of the different training times
ggplot(wm_t, aes(x = cond, y = gain, fill = cond)) + geom_boxplot()
What do you see? The boxplot shows a difference in the group means. But, we have to doubt thie result and ask again. Is this difference significant or simply happens by chance? To do this, we have to Levenes test.
Also, Here is another problem as well.. What if the group variance is not equal? The condition is different, so, SE, sampling distribution, and p-value that we will perform are all invalid. So, to do this, we need statistical method, Levenes test.
Levenes test is to compare variance NOT means.
If significant, then homogeneity of variance assumption is violated. If assumption violated, it says something important about your data. We will see this later.
# Levenes test
# install.packages("car")
library(car)
## Warning: package 'car' was built under R version 3.4.1
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
leveneTest(wm_t$gain ~ wm_t$cond)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 1.3134 0.2763
## 76
The leveneTest is from the ‘car’ package. The p-value from Levenes test is 0.28, which means the test result is not significant and thus the homogeneity of variance assumption holds up.
By this moment, we try to use Welch’s procedure to make independent t-test more conservative.
T-value = (observed - expected) / SE
Okay, Let’s try this manually below.
# Find the mean intelligence gain for both the 8 and 19 training day group
# Step 0. Determine the number of subjects in each sample.
n_t08 <- nrow(wm_t08)
n_t19 <- nrow(wm_t19)
# Step 1. Mean
mean_t08 <- mean(wm_t08$gain)
mean_t19 <- mean(wm_t19$gain)
# Step 2. Variance
var_t08 <- var(wm_t08$gain) # (wm_t08$gain - mean_t08) / (n-1)
var_t19 <- var(wm_t19$gain) # (wm_t19$gain - mean_t19) / (n-1)
# Step 3. Calculate standard deviations
SD_t08 <- sd(wm_t08$gain)
SD_t19 <- sd(wm_t19$gain)
# Step 4. SE(t08) & SE(t19)
SE_t08 <- sqrt(var_t08/n_t08)
SE_t19 <- sqrt(var_t19/n_t19)
# Step 5. SE for independent T-Test
SE_train <- sqrt(var_t08/n_t08 + var_t19/n_t19)
# Step 6. t-value
WM_train_t_value <- (mean_t19 - mean_t08) / SE_train
WM_t_test_df <- data.frame(
group = c("t08", "t19"),
sample_size = c(n_t08, n_t19),
Mean = c(mean_t08, mean_t19),
Variance = c(var_t08, var_t19),
Standard_deviation = c(SD_t08, SD_t19),
Standard_error = c(SE_t08, SE_t19)
)
WM_t_test_df
## group sample_size Mean Variance Standard_deviation Standard_error
## 1 t08 20 1.35 1.502632 1.225819 0.2741014
## 2 t19 20 5.60 2.989474 1.729009 0.3866183
# Step 6. Calculate degrees of freedom
degrees_of_freedom <- (n_t08 + n_t19) - 2
# Step 7. Calculate p-value
WM_train_p_value <- 2 * (1 - pt(WM_train_t_value, df = degrees_of_freedom))
# Step 8. Calculate standard deviations
SD_t08 <- sd(wm_t08$gain)
SD_t19 <- sd(wm_t19$gain)
# Step 9. Calculate the pooled standard deviation.
SD_train <- (SD_t08 + SD_t19) / 2
# Step 10. Calculate Cohens d
cohens_d <- (mean_t19 - mean_t08) / SD_train
WM_t_test_result <- data.frame(
T_Test_Category = c("Standard Error", "T_value", "degrees_of_freedom", "P_value", "Pooled Standard Deviation", "Cohens_d" ),
Values = c(SE_train, WM_train_t_value, degrees_of_freedom, WM_train_p_value, SD_train, cohens_d)
)
WM_t_test_result
## T_Test_Category Values
## 1 Standard Error 4.739254e-01
## 2 T_value 8.967657e+00
## 3 degrees_of_freedom 3.800000e+01
## 4 P_value 6.443468e-11
## 5 Pooled Standard Deviation 1.477414e+00
## 6 Cohens_d 2.876648e+00
t.test(wm_t19$gain, wm_t08$gain, var.equal = TRUE)
##
## Two Sample t-test
##
## data: wm_t19$gain and wm_t08$gain
## t = 8.9677, df = 38, p-value = 6.443e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.290588 5.209412
## sample estimates:
## mean of x mean of y
## 5.60 1.35
# install.packages("lsr")
library(lsr)
cohensD(wm_t19$gain, wm_t08$gain, method = "pooled")
## [1] 2.835822