1 4. Two-Sample T-Test (with VST) (By Ajala, Ponmile - Kininge, Rucha - Tejada, Omar )

In a Two-Sample t-test, we look to compare the means of two independent groups in order to determine if there’s a significant difference between them. We start by assuming normality and constant variance. We will explore the case where we don’t have constant variance. If this is the case, we will need to use a Variance Stabilizing Transformation (VST). It is needed to have 2 pupulations, otherwise it would be considered ANOVA testing.

Summary: Step 1 - Use T-test if we have constant variance (strong assumption) and normality (not strong) Step 2 - If we don’t have constant variance, we use VST (BoxCot) Step 3 - Non-parametric test (Kruskal Wallace)

1.1 Assumptions

  1. You have a data that is not normally distributed. For the standard t-test to be accurate, it requires normality however, non-normal data are best done with nonparametric tests.

  2. The two group have constant variances. the degree of freedom can be adjusted when

  3. There is a small sample size, usually less than 30.

  4. The sample collected are in no way related to each other. In other words, they are independent samples.

  5. The data collected is obtained randomly. This allows for generalization to a broader audience.

  6. Also, we assume that the population variance are different.

  7. The data collected should not have been altered in any way.

  8. There are no outliers in the sample that can skew the data

  9. We expect a sample large enough that can help perform a reliable test.

This section discusses when a two ample T-test (VST) or non parametric test will be necessary, compared to a standard independent test: Apply a VST or non parametric if the following are true:

The standard independent t-test applies to the following: 1. the data is normal with equal variance 2. The data consist of simple random samples

1.2 Process

1.2.1 Sample Size Determination

library(pwr)
pwr.t.test(n=NULL,d=.8,sig.level=0.05,power=.80,type="two.sample")
## 
##      Two-sample t test power calculation 
## 
##               n = 25.52458
##               d = 0.8
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

1.2.2 Design Layout

(Use a modified OutDesign function here. Talk about Exporting) GENERATE THE DESIGN/OUTPUT IT.

1.2.3 Collect Data

(Talk about adding response to csv. Use InDesign function here.) ReAD DATA BACK WITH INDESIGN

1.2.4 Preliminary Plots

(comment on how these might be used)

1.2.5 Statistical Test

The primary objective of the two-sample t-test is to test whether the means of two populations differ. We state the null and alternate hypothesis as follows:

\[ H_o : \mu_1 = \mu_2 \]

\[ H_a : \mu_1 \neq \mu_2 \]

Linear Effects Equation

\[ Y_{ij}=\mu +\tau_{ij}+\epsilon _{ij} \] Where μ = Mean, τ = effect, ε = error

Linear Effects Equation after VST \[ Y_{ij}^{\lambda } =\mu +\tau_{ij}+\epsilon _{ij\lambda} \]

Where ε is now N(0,σ2)

1.2.6 Residuals

We use the Residuals vs Fitted plot to look for constant variance. If we can observe a constant spread between the populations, then we assume constant variance. Residuals graph shown below shows a varying spread between populations and therefore we do not have constant variance. VST is required.

1.2.7 Inference

After using VST, we would hope that the residuals plot shows constant variance and therefore we meet the conditions to conduct a two sample t-test.

1.2.8 Multiple Comparisons

To verify our result, we may follow with a non-parametric approach using a Kruskal Wallace test.

1.3 Example

(review all relevant steps above) (when reading in data, do from your own github account)

2 Example

[This is where we state the problem, and repeat the steps we mentioned above]

We use ANOVA model to predict variance and normality using the Residuals vs Fitted plot

#Question b
library(tidyr)
library(dplyr)
data1<-c(0.34,0.12,1.23,0.70,1.75,0.12)
data2<-c(0.91,2.94,2.14,2.36,2.86,4.55)
data3<-c(6.31,8.37,9.75,6.09,9.82,7.24)
data4<-c(17.15,11.82,10.97,17.20,14.35,16.82)
data<-data.frame(data1,data2,data3,data4)
datapivot<-pivot_longer(data,c(data1,data2,data3,data4))
aov.model<-aov(value~name,data=datapivot)
summary(aov.model)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## name         3  708.7   236.2   76.29  4e-11 ***
## Residuals   20   61.9     3.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(aov.model)

Conclusion:

Looking at the Residuals vs Fitted plot, we cannot confirm a constant spread between the residuals at the different fitted values indicating unstable variance.

For validation, we may also use the boxplot to visually check for variance as shown below. The width of the boxplot varies too much with respect to each different estimation method indicating unstable variance.

#visual way
dataplot<-c(data1,data2,data3,data4)
x<-c(rep(1,6),rep(2,6),rep(3,6),rep(4,6))
meanx<-c(rep(mean(data1),6),rep(mean(data2),6),rep(mean(data3,6),6),rep(mean(data4,6),6))
boxplot(dataplot~x,xlab="Estimation Method",ylab="observation",main="Boxplot of Observations")

Performimg Kruskal-Wallace test

#Question c
kruskal.test(value~name,data=datapivot)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  value by name
## Kruskal-Wallis chi-squared = 21.156, df = 3, p-value = 9.771e-05

With a p-value of 9.771e-05 & 0.05 level of significance, we will reject the null hypothesis.

#Question d
library(MASS)

Based on the above plot, we can say the value of lambda is approximately 0.5

boxcox(dataplot~x)

lambda=.5  # only if 1 is not in CI on lambda
dataplot<-dataplot^(lambda) # if lambda is not zero
#pop<-log(pop) # if lambda is equal to zero 
boxcox(dataplot~x)

The above plot now shows a value of lambda close to 1, indicating the transformation was effective.

boxplot(dataplot~x,xlab="Method Type",ylab="Flow Frequency",main="Boxplot of Observations")