1 Introduction

In this study, we can practice the knowledge about the t-test from assumption identified to hypothesis established.

2 Preparation

2.1 Environment

if(!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, kableExtra, skimr, ggplot2, GGally, ggsci, car)
theme = theme_bw() +
  theme(plot.title = element_text(face = "bold", size = (15)),
        plot.subtitle = element_text(size = (10)),
        axis.title = element_text(size = (10))) +
  theme(axis.text.x = element_text(angle = 0), legend.position = "none")

Let us set up the working environment and be ready for the analysis.

2.2 Dataset

df = read.csv("DATA.csv",
              stringsAsFactors = T)
df$LandID = as.factor(df$LandID)
df %>% 
  kbl(align = "c",
      caption = "Sugarbeets Yield on Herbicide Environment Dataset") %>% 
  kable_classic("hover", "bordered", "striped")
Sugarbeets Yield on Herbicide Environment Dataset
LandID Variety Yield
1 Merlin 76.2
1 Merlin 49.6
1 Merlin 61.6
1 Merlin 49.8
1 Merlin 37.0
2 Merlin 44.8
2 Merlin 28.7
2 Merlin 49.6
2 Merlin 38.6
2 Merlin 38.1
3 Golden 47.4
3 Golden 19.8
3 Golden 37.4
3 Golden 34.2
3 Golden 34.0
4 Golden 47.0
4 Golden 24.9
4 Golden 24.6
4 Golden 32.8
4 Golden 46.4

Two kinds of sugar beet plant on five different lands separately. Both have been producing relatively similar yields historically. As sprayed lands with the herbicide last month, both of the sugar beets are still yielding but some differences have been noticed.

skim_without_charts(df)
Data summary
Name df
Number of rows 20
Number of columns 3
_______________________
Column type frequency:
factor 2
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
LandID 0 1 FALSE 4 1: 5, 2: 5, 3: 5, 4: 5
Variety 0 1 FALSE 2 Gol: 10, Mer: 10

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Yield 0 1 41.12 13.27 19.8 33.7 38.35 47.95 76.2

Here is the summary of the dataset.

ggpairs(df,
        columns = c("Yield", "Variety", "LandID"),
        aes(color = Variety,
            alpha = 0.5)) +
  theme +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank()) +
  scale_fill_locuszoom() +
  labs(caption = "RD: Merlin\nOG: Golden")

The ensemble plot can tell the distribution and structure in yield with different groups. It appears that the statistics of Merlin variety is doing better than the Golden variety. We will apply statistical tests to prove the statistical difference or not.

3 Statistical Analysis

3.1 Two sample t-test

a = c("Student’s two sample t-test",
      "Welch’s t-test",
      "Paired t-test",
      "Wilcoxon Matched Pairs",
      "Mann Whitney test")
b = c("v", "v", "v", "x", "x")
c = c("x", "x", "v", "v", "x")
d = c("v", "x", "\\-", "\\-", "\\-")

data.frame(a, b, c, d) %>% 
  rename("Two sample t-tests" = "a",
         "Normally distributed?" = "b",
         "Paired?" = "c",
         "Equal variance?" = "d") %>% 
  kbl(align = "l",
      caption = "v: Yes, x: No, -: Not required") %>% 
  kable_classic("hover", "bordered", "striped")
v: Yes, x: No, -: Not required
Two sample t-tests Normally distributed? Paired? Equal variance?
Student’s two sample t-test v x v
Welch’s t-test v x x
Paired t-test v v -
Wilcoxon Matched Pairs x v -
Mann Whitney test x x -

The table summarizes the three conditions to be fulfilled before deciding which t-test to use.

  1. Are both variables normally distributed?
  2. Are the samples in a paired trial? Commonly known as before/after sample objects experiencing two conditions.
  3. Do they have equal variance?

We can apply additional tests for 1 and 3. Whereas for 2, we can only decide based on the experimental design for sampling.

3.1.1 Shapiro-wilk test of normality

  • Null hypothesis (H0): normally distributed
  • Alternative hypothesis (H1): not normally distributed
by(df$Yield, df$Variety, shapiro.test)
## df$Variety: Golden
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.91458, p-value = 0.314
## 
## ------------------------------------------------------------ 
## df$Variety: Merlin
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.92652, p-value = 0.4145

P-value of both Golden and Merlin sugar beet varieties show that the results are not significant and thus we can not reject H0 and accept H0, in which both yields of Golden and Merlin are normally distributed.

As long as one variable is not normally distributed, the result of this text is considered not normally distributed.

Thus, normally distributed is checked as v.

3.1.2 Deciding are the samples paired

Paired sampling is before/after the experiment. Data are measured in one object. This is not the sampling or experiment way of this dataset collecting.

Thus, data paired is checked as x.

3.1.3 Bartlett’s test for homogeneity of variance

  • Null hypothesis (H0): variances are equal
  • Alternative hypothesis (H1): variances are not equal
bartlett.test(data = df, Yield ~ Variety)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Yield by Variety
## Bartlett's K-squared = 0.86549, df = 1, p-value = 0.3522

Bartlett’s test is used for testing the homogeneity of variances. It is adapted for normally distributed data.

leveneTest(data = df, Yield ~ Variety)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  0.4946 0.4909
##       18

Levene’s test is a more robust alternative to Bartlett’s test when the distributions of the data are non-normal.

In this case, we will take Bartlett’s test result due to normally distributed data. P-value as greater than 0.05, we can not reject H0 and accept H0, which assumes variances are equal.

Thus, equal variance is checked as v.

3.1.4 Student’s two sample t-test

  1. Normally distributed? - v
  2. Data paired? - x
  3. Equal variance? - v

It is clearly that the assumption of student’s two sample t-test is fulfilled, and this will be used to computed to test the statistical differences between its data.

t.test(df$Yield ~ df$Variety, var.equal = T)
## 
##  Two Sample t-test
## 
## data:  df$Yield by df$Variety
## t = -2.355, df = 18, p-value = 0.03007
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -23.746171  -1.353829
## sample estimates:
## mean in group Golden mean in group Merlin 
##                34.85                47.40

It is the student’s two sample t-test from R. It shows that the mean value of Merlin is higher than Golden in a significant difference as p-value < 0.05. We can reject H0 and accept H1, which there is a significant difference between Merlin and Golden.

In detail, this is a two tail two sample student’s t-test. As two tail, the hypothesis is that means is/not equal.

If having a more specific purpose for the test, we can set up one tail test, which the hypothesis is that means is/not greater/lesser.

t.test(df$Yield ~ df$Variety, var.equal = T,
       alternative = "less")
## 
##  Two Sample t-test
## 
## data:  df$Yield by df$Variety
## t = -2.355, df = 18, p-value = 0.01504
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -3.30888
## sample estimates:
## mean in group Golden mean in group Merlin 
##                34.85                47.40

So, in this way, the difference in means (Golden - Merlin) is statistically difference less than 0 as p-value < 0.05. The p-value is only half of the two tail test.

3.2 One sample t-test

3.2.1 Assumpted Merlin as population

mer = df %>% filter(Variety == "Merlin") %>% select(Yield) %>% colMeans() %>% unname()
gol = df %>% filter(Variety == "Golden") %>% select(Yield)
t.test(gol, mu = mer, var.equal = T,
       alternative = "less")
## 
##  One Sample t-test
## 
## data:  gol
## t = -4.0114, df = 9, p-value = 0.001529
## alternative hypothesis: true mean is less than 47.4
## 95 percent confidence interval:
##      -Inf 40.58512
## sample estimates:
## mean of x 
##     34.85

In this example, we know that Golden yield in our sample is statistically lower than our assumpted population parameter mean (47.4) due to p-value < 0.05.

3.2.2 Assumpted Golden as population

mer = df %>% filter(Variety == "Merlin") %>% select(Yield)
gol = df %>% filter(Variety == "Golden") %>% select(Yield) %>% colMeans() %>% unname()
t.test(mer, mu = gol, var.equal = T,
       alternative = "greater")
## 
##  One Sample t-test
## 
## data:  mer
## t = 2.909, df = 9, p-value = 0.008669
## alternative hypothesis: true mean is greater than 34.85
## 95 percent confidence interval:
##  39.4917     Inf
## sample estimates:
## mean of x 
##      47.4

In this example, we know that Merlin yield in our sample is statistically higher than our assumpted population parameter mean (34.85) due to p-value < 0.05.

4 Conclusion

If Merlin production is the usual production rate of the farm, we said that production of both varieties was somewhat equal before herbicide application. Keeping all other environmental and human factors equal, we may say the reduction of Golden variety could be due to herbicide application with the support of the statistical results.

5 Reference