In this study, we can practice the knowledge about the t-test from assumption identified to hypothesis established.
if(!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, kableExtra, skimr, ggplot2, GGally, ggsci, car)
theme = theme_bw() +
theme(plot.title = element_text(face = "bold", size = (15)),
plot.subtitle = element_text(size = (10)),
axis.title = element_text(size = (10))) +
theme(axis.text.x = element_text(angle = 0), legend.position = "none")
Let us set up the working environment and be ready for the analysis.
df = read.csv("DATA.csv",
stringsAsFactors = T)
df$LandID = as.factor(df$LandID)
df %>%
kbl(align = "c",
caption = "Sugarbeets Yield on Herbicide Environment Dataset") %>%
kable_classic("hover", "bordered", "striped")
LandID | Variety | Yield |
---|---|---|
1 | Merlin | 76.2 |
1 | Merlin | 49.6 |
1 | Merlin | 61.6 |
1 | Merlin | 49.8 |
1 | Merlin | 37.0 |
2 | Merlin | 44.8 |
2 | Merlin | 28.7 |
2 | Merlin | 49.6 |
2 | Merlin | 38.6 |
2 | Merlin | 38.1 |
3 | Golden | 47.4 |
3 | Golden | 19.8 |
3 | Golden | 37.4 |
3 | Golden | 34.2 |
3 | Golden | 34.0 |
4 | Golden | 47.0 |
4 | Golden | 24.9 |
4 | Golden | 24.6 |
4 | Golden | 32.8 |
4 | Golden | 46.4 |
Two kinds of sugar beet plant on five different lands separately. Both have been producing relatively similar yields historically. As sprayed lands with the herbicide last month, both of the sugar beets are still yielding but some differences have been noticed.
Name | df |
Number of rows | 20 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
LandID | 0 | 1 | FALSE | 4 | 1: 5, 2: 5, 3: 5, 4: 5 |
Variety | 0 | 1 | FALSE | 2 | Gol: 10, Mer: 10 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
Yield | 0 | 1 | 41.12 | 13.27 | 19.8 | 33.7 | 38.35 | 47.95 | 76.2 |
Here is the summary of the dataset.
ggpairs(df,
columns = c("Yield", "Variety", "LandID"),
aes(color = Variety,
alpha = 0.5)) +
theme +
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank()) +
scale_fill_locuszoom() +
labs(caption = "RD: Merlin\nOG: Golden")
The ensemble plot can tell the distribution and structure in yield with different groups. It appears that the statistics of Merlin variety is doing better than the Golden variety. We will apply statistical tests to prove the statistical difference or not.
a = c("Student’s two sample t-test",
"Welch’s t-test",
"Paired t-test",
"Wilcoxon Matched Pairs",
"Mann Whitney test")
b = c("v", "v", "v", "x", "x")
c = c("x", "x", "v", "v", "x")
d = c("v", "x", "\\-", "\\-", "\\-")
data.frame(a, b, c, d) %>%
rename("Two sample t-tests" = "a",
"Normally distributed?" = "b",
"Paired?" = "c",
"Equal variance?" = "d") %>%
kbl(align = "l",
caption = "v: Yes, x: No, -: Not required") %>%
kable_classic("hover", "bordered", "striped")
Two sample t-tests | Normally distributed? | Paired? | Equal variance? |
---|---|---|---|
Student’s two sample t-test | v | x | v |
Welch’s t-test | v | x | x |
Paired t-test | v | v | - |
Wilcoxon Matched Pairs | x | v | - |
Mann Whitney test | x | x | - |
The table summarizes the three conditions to be fulfilled before deciding which t-test to use.
We can apply additional tests for 1 and 3. Whereas for 2, we can only decide based on the experimental design for sampling.
## df$Variety: Golden
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.91458, p-value = 0.314
##
## ------------------------------------------------------------
## df$Variety: Merlin
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.92652, p-value = 0.4145
P-value of both Golden and Merlin sugar beet varieties show that the results are not significant and thus we can not reject H0 and accept H0, in which both yields of Golden and Merlin are normally distributed.
As long as one variable is not normally distributed, the result of this text is considered not normally distributed.
Thus, normally distributed is checked as v.
Paired sampling is before/after the experiment. Data are measured in one object. This is not the sampling or experiment way of this dataset collecting.
Thus, data paired is checked as x.
##
## Bartlett test of homogeneity of variances
##
## data: Yield by Variety
## Bartlett's K-squared = 0.86549, df = 1, p-value = 0.3522
Bartlett’s test is used for testing the homogeneity of variances. It is adapted for normally distributed data.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.4946 0.4909
## 18
Levene’s test is a more robust alternative to Bartlett’s test when the distributions of the data are non-normal.
In this case, we will take Bartlett’s test result due to normally distributed data. P-value as greater than 0.05, we can not reject H0 and accept H0, which assumes variances are equal.
Thus, equal variance is checked as v.
It is clearly that the assumption of student’s two sample t-test is fulfilled, and this will be used to computed to test the statistical differences between its data.
##
## Two Sample t-test
##
## data: df$Yield by df$Variety
## t = -2.355, df = 18, p-value = 0.03007
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -23.746171 -1.353829
## sample estimates:
## mean in group Golden mean in group Merlin
## 34.85 47.40
It is the student’s two sample t-test from R. It shows that the mean value of Merlin is higher than Golden in a significant difference as p-value < 0.05. We can reject H0 and accept H1, which there is a significant difference between Merlin and Golden.
In detail, this is a two tail two sample student’s t-test. As two tail, the hypothesis is that means is/not equal.
If having a more specific purpose for the test, we can set up one tail test, which the hypothesis is that means is/not greater/lesser.
##
## Two Sample t-test
##
## data: df$Yield by df$Variety
## t = -2.355, df = 18, p-value = 0.01504
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -3.30888
## sample estimates:
## mean in group Golden mean in group Merlin
## 34.85 47.40
So, in this way, the difference in means (Golden - Merlin) is statistically difference less than 0 as p-value < 0.05. The p-value is only half of the two tail test.
mer = df %>% filter(Variety == "Merlin") %>% select(Yield) %>% colMeans() %>% unname()
gol = df %>% filter(Variety == "Golden") %>% select(Yield)
t.test(gol, mu = mer, var.equal = T,
alternative = "less")
##
## One Sample t-test
##
## data: gol
## t = -4.0114, df = 9, p-value = 0.001529
## alternative hypothesis: true mean is less than 47.4
## 95 percent confidence interval:
## -Inf 40.58512
## sample estimates:
## mean of x
## 34.85
In this example, we know that Golden yield in our sample is statistically lower than our assumpted population parameter mean (47.4) due to p-value < 0.05.
mer = df %>% filter(Variety == "Merlin") %>% select(Yield)
gol = df %>% filter(Variety == "Golden") %>% select(Yield) %>% colMeans() %>% unname()
t.test(mer, mu = gol, var.equal = T,
alternative = "greater")
##
## One Sample t-test
##
## data: mer
## t = 2.909, df = 9, p-value = 0.008669
## alternative hypothesis: true mean is greater than 34.85
## 95 percent confidence interval:
## 39.4917 Inf
## sample estimates:
## mean of x
## 47.4
In this example, we know that Merlin yield in our sample is statistically higher than our assumpted population parameter mean (34.85) due to p-value < 0.05.
If Merlin production is the usual production rate of the farm, we said that production of both varieties was somewhat equal before herbicide application. Keeping all other environmental and human factors equal, we may say the reduction of Golden variety could be due to herbicide application with the support of the statistical results.