1 Introduction
2 Preparation
- 2.1 Environment
- 2.2 Dataset
3 Statistical Analysis
- 3.1 Two sample t-test
- 3.2 One sample t-test
  - 3.2.1 Assumpted Merlin as population
  - 3.2.2 Assumpted Golden as population
4 Conclusion
5 Reference

1 Introduction

In this study, we can practice the knowledge about the t-test from assumption identified to hypothesis established.

2 Preparation

2.1 Environment

if(!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, kableExtra, skimr, ggplot2, GGally, ggsci, car)
theme = theme_bw() +
  theme(plot.title = element_text(face = "bold", size = (15)),
        plot.subtitle = element_text(size = (10)),
        axis.title = element_text(size = (10))) +
  theme(axis.text.x = element_text(angle = 0), legend.position = "none")

Let us set up the working environment and be ready for the analysis.

2.2 Dataset

df = read.csv("DATA.csv",
              stringsAsFactors = T)
df$LandID = as.factor(df$LandID)
df %>% 
  kbl(align = "c",
      caption = "Sugarbeets Yield on Herbicide Environment Dataset") %>% 
  kable_classic("hover", "bordered", "striped")

Sugarbeets Yield on Herbicide Environment Dataset
LandID	Variety	Yield
1	Merlin	76.2
1	Merlin	49.6
1	Merlin	61.6
1	Merlin	49.8
1	Merlin	37.0
2	Merlin	44.8
2	Merlin	28.7
2	Merlin	49.6
2	Merlin	38.6
2	Merlin	38.1
3	Golden	47.4
3	Golden	19.8
3	Golden	37.4
3	Golden	34.2
3	Golden	34.0
4	Golden	47.0
4	Golden	24.9
4	Golden	24.6
4	Golden	32.8
4	Golden	46.4

Two kinds of sugar beet plant on five different lands separately. Both have been producing relatively similar yields historically. As sprayed lands with the herbicide last month, both of the sugar beets are still yielding but some differences have been noticed.

skim_without_charts(df)

Data summary

Name	df
Number of rows	20
Number of columns	3
_______________________
Column type frequency:
factor	2
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
LandID	0	1	FALSE	4	1: 5, 2: 5, 3: 5, 4: 5
Variety	0	1	FALSE	2	Gol: 10, Mer: 10

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Yield	0	1	41.12	13.27	19.8	33.7	38.35	47.95	76.2

Here is the summary of the dataset.

ggpairs(df,
        columns = c("Yield", "Variety", "LandID"),
        aes(color = Variety,
            alpha = 0.5)) +
  theme +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank()) +
  scale_fill_locuszoom() +
  labs(caption = "RD: Merlin\nOG: Golden")

The ensemble plot can tell the distribution and structure in yield with different groups. It appears that the statistics of Merlin variety is doing better than the Golden variety. We will apply statistical tests to prove the statistical difference or not.

3 Statistical Analysis

3.1 Two sample t-test

a = c("Student’s two sample t-test",
      "Welch’s t-test",
      "Paired t-test",
      "Wilcoxon Matched Pairs",
      "Mann Whitney test")
b = c("v", "v", "v", "x", "x")
c = c("x", "x", "v", "v", "x")
d = c("v", "x", "\\-", "\\-", "\\-")

data.frame(a, b, c, d) %>% 
  rename("Two sample t-tests" = "a",
         "Normally distributed?" = "b",
         "Paired?" = "c",
         "Equal variance?" = "d") %>% 
  kbl(align = "l",
      caption = "v: Yes, x: No, -: Not required") %>% 
  kable_classic("hover", "bordered", "striped")

v: Yes, x: No, -: Not required
Two sample t-tests	Normally distributed?	Paired?	Equal variance?
Student’s two sample t-test	v	x	v
Welch’s t-test	v	x	x
Paired t-test	v	v	-
Wilcoxon Matched Pairs	x	v	-
Mann Whitney test	x	x	-

The table summarizes the three conditions to be fulfilled before deciding which t-test to use.

Are both variables normally distributed?
Are the samples in a paired trial? Commonly known as before/after sample objects experiencing two conditions.
Do they have equal variance?

We can apply additional tests for 1 and 3. Whereas for 2, we can only decide based on the experimental design for sampling.

3.1.1 Shapiro-wilk test of normality

Null hypothesis (H0): normally distributed
Alternative hypothesis (H1): not normally distributed

by(df$Yield, df$Variety, shapiro.test)

## df$Variety: Golden
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.91458, p-value = 0.314
## 
## ------------------------------------------------------------ 
## df$Variety: Merlin
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.92652, p-value = 0.4145

P-value of both Golden and Merlin sugar beet varieties show that the results are not significant and thus we can not reject H0 and accept H0, in which both yields of Golden and Merlin are normally distributed.

As long as one variable is not normally distributed, the result of this text is considered not normally distributed.

Thus, normally distributed is checked as v.

3.1.2 Deciding are the samples paired

Paired sampling is before/after the experiment. Data are measured in one object. This is not the sampling or experiment way of this dataset collecting.

Thus, data paired is checked as x.

3.1.3 Bartlett’s test for homogeneity of variance

Null hypothesis (H0): variances are equal
Alternative hypothesis (H1): variances are not equal

bartlett.test(data = df, Yield ~ Variety)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  Yield by Variety
## Bartlett's K-squared = 0.86549, df = 1, p-value = 0.3522

Bartlett’s test is used for testing the homogeneity of variances. It is adapted for normally distributed data.

leveneTest(data = df, Yield ~ Variety)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  0.4946 0.4909
##       18

Levene’s test is a more robust alternative to Bartlett’s test when the distributions of the data are non-normal.

In this case, we will take Bartlett’s test result due to normally distributed data. P-value as greater than 0.05, we can not reject H0 and accept H0, which assumes variances are equal.

Thus, equal variance is checked as v.

3.1.4 Student’s two sample t-test

Normally distributed? - v
Data paired? - x
Equal variance? - v

It is clearly that the assumption of student’s two sample t-test is fulfilled, and this will be used to computed to test the statistical differences between its data.

t.test(df$Yield ~ df$Variety, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  df$Yield by df$Variety
## t = -2.355, df = 18, p-value = 0.03007
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -23.746171  -1.353829
## sample estimates:
## mean in group Golden mean in group Merlin 
##                34.85                47.40

It is the student’s two sample t-test from R. It shows that the mean value of Merlin is higher than Golden in a significant difference as p-value < 0.05. We can reject H0 and accept H1, which there is a significant difference between Merlin and Golden.

In detail, this is a two tail two sample student’s t-test. As two tail, the hypothesis is that means is/not equal.

If having a more specific purpose for the test, we can set up one tail test, which the hypothesis is that means is/not greater/lesser.

t.test(df$Yield ~ df$Variety, var.equal = T,
       alternative = "less")

## 
##  Two Sample t-test
## 
## data:  df$Yield by df$Variety
## t = -2.355, df = 18, p-value = 0.01504
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -3.30888
## sample estimates:
## mean in group Golden mean in group Merlin 
##                34.85                47.40

So, in this way, the difference in means (Golden - Merlin) is statistically difference less than 0 as p-value < 0.05. The p-value is only half of the two tail test.

3.2 One sample t-test

3.2.1 Assumpted Merlin as population

mer = df %>% filter(Variety == "Merlin") %>% select(Yield) %>% colMeans() %>% unname()
gol = df %>% filter(Variety == "Golden") %>% select(Yield)
t.test(gol, mu = mer, var.equal = T,
       alternative = "less")

## 
##  One Sample t-test
## 
## data:  gol
## t = -4.0114, df = 9, p-value = 0.001529
## alternative hypothesis: true mean is less than 47.4
## 95 percent confidence interval:
##      -Inf 40.58512
## sample estimates:
## mean of x 
##     34.85

In this example, we know that Golden yield in our sample is statistically lower than our assumpted population parameter mean (47.4) due to p-value < 0.05.

3.2.2 Assumpted Golden as population

mer = df %>% filter(Variety == "Merlin") %>% select(Yield)
gol = df %>% filter(Variety == "Golden") %>% select(Yield) %>% colMeans() %>% unname()
t.test(mer, mu = gol, var.equal = T,
       alternative = "greater")

## 
##  One Sample t-test
## 
## data:  mer
## t = 2.909, df = 9, p-value = 0.008669
## alternative hypothesis: true mean is greater than 34.85
## 95 percent confidence interval:
##  39.4917     Inf
## sample estimates:
## mean of x 
##      47.4

In this example, we know that Merlin yield in our sample is statistically higher than our assumpted population parameter mean (34.85) due to p-value < 0.05.

4 Conclusion

If Merlin production is the usual production rate of the farm, we said that production of both varieties was somewhat equal before herbicide application. Keeping all other environmental and human factors equal, we may say the reduction of Golden variety could be due to herbicide application with the support of the statistical results.

5 Reference

Statistical Analysis / 2021 / Kar Ng

T Test Analysis

Chun-Li Hou

July 05, 2021