Reading time: 16 minutes
This project uses a dataset collected from a cucumber genotype experiment. The experiment was carried out using an agriculture research system called “Latin Square Design” and the experiment was repeated in 2 locations. This type of experiment is commonly called multi-environment Latin Square Design. The main interest of this experiment is to statistically compare 4 different cucumber genotypes (or varieties).
Analysis in this projects were split into 3 sections. First section (section 7.1) imagined that the experiment was only conducted in the first location, the second section (section 7.2) imagined that the experiment was only conducted in the second location, and the last section (section 7.3) is an overall statistical analysis that analysed the overall dataset with the effect of locations (technically known as “environments”) being taken into account. Please note that the first two sections were just additional sections/works to demonstrate my analytically skill in analysising common Latin Square designed research system.
Final statistical result was achieved by a multi-environment Latin Square model that took environment aspects and other random effects into consideration when computing the statistical output. Result revealed that cucumber genotype “Dasher” was the best performing cucumber variety, followed by “Guardian”, “Sprint”, and “Poinsett” genotype. “Dasher” genotype was significantly higher than genotypes “Sprint” and “Poinsett” in yield, but it was not significantly higher than “Guardian” genotype.
Highlights
R packages loaded in this project include tidyverse packages (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats), skimr, lubridate, kableEtra, agridat, MASS, DescTools, DescTools, lme4, qqplotr, lsmeans, multcomp, and desplot.
library(tidyverse)
library(skimr)
library(lubridate)
library(kableExtra)
library(agridat)
library(MASS)
library(DescTools)
library(lme4)
library(lmerTest)
library(qqplotr)
library(lsmeans)
library(multcomp)
library(desplot)
library(DT)
This is a personal project uses a public dataset that has dual agricultural systems. The dataset was collected from an agriculture research system named “Latin Square” and the system was repeated in two locations. The experiment was carried out at Clemson university in South Carolina, and Tifton in Georgia in 1985.
The involvement of more than one research location is often referred as multi-environment, and the experiment is often called a multi-environment research system. The environment of the two locations might be different to a certain extent and would have noticeable or unnoticeable impact on the experiment. Identifying that it is a multi-environment system will help the researcher to take environment aspects into account during statistical analysis, and to quantify the impacts from these environments. This will help to achieve a more accurate statistical conclusion.
Latin Square Design on the other hand is a common technique applied to control known and unknown variations in the field. For example of known variations, Latin Square Design is commonly used when there are more than 2 sources of environmental variations. Latin Square Design splits the plot of Latin Square system in a way that all treatments will have a maximised chance experiencing these variations. A typical example would be that the (1) soil condition from one side is more favor to plant growth (such as higher in moisture) than in another side of the allocated research field, and (2) the experimental field is adjacent to the road. Position towards the sun also can be another variation in the field. To account for these variations that would potentially impact the experiment to an extent, a Latin Square design is adopted. Latin Square design can also be applied even if there is no known source of variation. Although there is no known source of variation discovered by the researchers’ initial assessment, a Latin Square can still be adopted as a system that ensuring each treatment is fairly split in the field to minimise impact from these hidden sources of variation.
When analysing a Latin Square system, if the Latin Square design have plots arranged based on known source of variations, a two-way analysis of variance can be adopted. If Latin Square design is adopted in a field without known source of variation, an one-way analysis of variance could be adopted. Actually, the data structure of a Latin Square designed experiment has “row” and “column” recorded to identify each plot in the field. Therefore, one can still adopt a two-way analysis of variance in a design that has no known source of variation to statistically find out are there differences between rows and columns. A mixed effect model can also be used for a Latin Square system that takes rows and columns as random effects. A major disadvantage of Latin Square is that it can only handle no more than 4 treatments.
This project will analyse the Latin Square system in two ways, one is a typical Latin Square design analysis for two of the locations and pretending that they are two unrelated experiment. It is just to demonstrate how I would analyse the data if I am given a research dataset from a Latin Square designed system. The other way (Final way) is a overall multi-environment Latin Square Design analysis. It is the true way of analysing this dataset by analysing all data together with consideration of environmental aspects and other random effects.
Following is the quick summary of this analysis project.
The dataset can be assessed through a R package named “agridat” and imported using following code (click the right button). Following table is the result of successful data import.
data("bridges.cucumber")
datatable(bridges.cucumber)
However, in order to show that I use SQL programming language, I have downloaded the dataset and uploaded it to a database platform named BigQuery. I will perform data checking and cleaning on the database with SQL programming language, although I can perform all of these steps in R. BigQuery is a widely used online database that allow users to work with data with SQL programming language.
Following picture shows that I have checked the dataset with SQL codes, and all columns turn out to be perfect and has no missing data, typos, duplicates, or unwanted spaces within strings or numerical data that need to be trimmed.
The dataset obviously has been cleaned by the dataset creator. In this section, I am using the SQL again to change the name of variables, it is an optional step but it comes with an advantage that the column name can be more understandable for readers from other industries.
Following picture shows how I changed the column names from:
During this operation, the additional column that needs to be cleaned, “int64_field_0”, has also been removed.
SQL codes:
After completing this step, the dataset is downloaded from the database.
The dataset is then uploaded to R using following codes. Following table is an indication of successful data import.
cucum <- read_csv("cucum_sqlcleaned.csv")
datatable(cucum)
Following table describes the variables in this dataset.
Variable <- c("location",
"genotype",
"row",
"column",
"yield")
Description <- c("The two locations involved in this research are Clemson and Tifton.",
"Four cucumber genotypes are Dasher, Guardian, Sprint, Poinsett.",
"The row id of the observation.",
"The column id of the observation.",
"Weight of marketable fruit per plot, unit not given.")
data.frame(Variable, Description) %>%
kbl() %>%
kable_styling(bootstrap_options = c("hover", "striped", "bordered"))
Variable | Description |
---|---|
location | The two locations involved in this research are Clemson and Tifton. |
genotype | Four cucumber genotypes are Dasher, Guardian, Sprint, Poinsett. |
row | The row id of the observation. |
column | The column id of the observation. |
yield | Weight of marketable fruit per plot, unit not given. |
Following table summarises the data structure of the dataset.
This cucumber dataset has 32 rows and 5 columns. Among the 5 columns, “location” and “genotype” are classified as “character” variables, whereas “row”, “column”, and “yield” are classified as “numeric” variable.
skim_without_charts(cucum)
Name | cucum |
Number of rows | 32 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
location | 0 | 1 | 6 | 7 | 0 | 2 | 0 |
genotype | 0 | 1 | 6 | 8 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
row | 0 | 1 | 2.50 | 1.14 | 1.0 | 1.75 | 2.50 | 3.25 | 4.00 |
column | 0 | 1 | 2.50 | 1.14 | 1.0 | 1.75 | 2.50 | 3.25 | 4.00 |
yield | 0 | 1 | 35.74 | 12.16 | 11.5 | 28.62 | 36.64 | 43.50 | 61.48 |
There is no missing value in the dataset by examining the columns n_missing and complete_rate. The complete_rate of all variables is 1, meaning 100% completion, and there is no missing value. However, it is a little bit tricky that if missing value are filled up into the dataset by the data uploader as string information such as “NULL” or “Missing Value”, then they will not be detected by this exploration method. However, this possibility has been eliminated as I have examine this possility in the SQL section. A part from that I can also see that there are no “white space” to trim.
To be more efficient in data exploration, I will convert the variable type of “location”, “genotype”, “row”, and “column” from their current type to factor type.
Here is another way to quicky glimpse the dataset and their type.
glimpse(cucum)
## Rows: 32
## Columns: 5
## $ location <chr> "Clemson", "Clemson", "Clemson", "Clemson", "Clemson", "Clems~
## $ genotype <chr> "Dasher", "Dasher", "Dasher", "Dasher", "Guardian", "Guardian~
## $ row <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1~
## $ column <dbl> 3, 4, 2, 1, 4, 2, 1, 3, 1, 3, 4, 2, 2, 1, 3, 4, 3, 4, 2, 1, 4~
## $ yield <dbl> 44.2000, 54.1000, 47.2000, 36.7000, 33.0000, 13.6000, 44.1000~
“chr” in R stands for character variable type, “dbl” stands for double which is typically used for numerical variables in R, and the data type I am converting the “location”, “genotype”, “row”, and “column” into is called “factor”, or abbreviated as “fct”.
Following code complete the conversion (click the right button).
cucum <- cucum %>%
mutate(location = as.factor(location),
genotype = as.factor(genotype),
row = as.factor(row),
column = as.factor(column)
)
Glimpse the dataset again, the conversion has been successful.
glimpse(cucum)
## Rows: 32
## Columns: 5
## $ location <fct> Clemson, Clemson, Clemson, Clemson, Clemson, Clemson, Clemson~
## $ genotype <fct> Dasher, Dasher, Dasher, Dasher, Guardian, Guardian, Guardian,~
## $ row <fct> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1~
## $ column <fct> 3, 4, 2, 1, 4, 2, 1, 3, 1, 3, 4, 2, 2, 1, 3, 4, 3, 4, 2, 1, 4~
## $ yield <dbl> 44.2000, 54.1000, 47.2000, 36.7000, 33.0000, 13.6000, 44.1000~
Following is the summary of the dataset.
summary(cucum)
## location genotype row column yield
## Clemson:16 Dasher :8 1:8 1:8 Min. :11.50
## Tifton :16 Guardian:8 2:8 2:8 1st Qu.:28.62
## Poinsett:8 3:8 3:8 Median :36.64
## Sprint :8 4:8 4:8 Mean :35.74
## 3rd Qu.:43.50
## Max. :61.48
A quick insight from the overall yield, I may stand a higher chance to be able to see significant difference in my later statistical analysis base on how far apart the minimum and maximum are from each other, as well as the data may be normally distributed base on how close are the median and mean to each other. These are just current assumptions, all will be tested during statistical analysis.
It will be interesting to plot out the typical layout of a Latin Square design. Please note it is a simulation by the software and not represent the true design from the experiment of this dataset by William Bridges 1989. Plot dimensions are not given.
There are 2 locations and 4 genotypes (or varieties) of cucumber in the dataset. A Latin Square layout would be like:
desplot(data = cucum,
form = yield ~ row*column | location,
text = genotype,
cex = 1,
strip.cex = 1,
col.regions = "white",
col.text = "orange",
main = "Figure 1: Cucumber - Multi-Environment Latin Square Design",
out2 = genotype,
out2.gpar = list(col = "grey", lwd = 0.5, lty = 2)
)
Insights:
Following desplot shows that 4 plots located in the bottom-left section of Clemson Latin Square is not favor for plant growth.
desplot(data = cucum,
form = yield ~ row * column | location,
text = yield,
cex = 1,
strip.cex = 1,
show.key = F,
main = "Figure 2: Cucumber - Multi-Environment Latin Square Design")
Insights:
Following figure visualise the yield of each cucumber genotype in the dataset in both location.
ggplot(cucum, aes(x = genotype, y = yield, colour = genotype, fill = genotype)) +
geom_boxplot(alpha = 0.5) +
stat_boxplot(geom = "errorbar") +
geom_point() +
stat_summary(fun.y = mean, shape = 4, geom = "point", size = 5, colour = "blue") +
facet_wrap(~location) +
theme_classic() +
theme(legend.position = "NA",
strip.text = element_text(size = 12),
strip.background = element_rect(fill = "grey90"),
plot.title = element_text(face = "bold", size = 13, vjust = 2)) +
labs(x = "Genotype",
y = "Yield (Weight)",
title = "Figure 3. Yield of different Cucumber Genotypes in Clemson and Tifton",
subtitle = "X in the boxplots represents: Average")
Insights:
My statistical analysis is partitioned into 3 sections:
The statistical analysis of this project can actually be completed by the third section alone. Feel free to skip to section 7.3.
However, I am completing the first 2 Latin Square design in their respective environment is just to demonstrate how I would analyse a non-multi-environment Latin square system.
This section imagines that the experiment only launched in Clemson.
Following code builds the model, the model is comprised of genotype, row and column. I use a two-ways ANOVA because it would be my interest to see how are yields statistically different from each other within rows and columns. Rows and columns are my variables of interest now especially after observing the problematic bottom-right corner of Clemson in figure 2.
# filter to get Clemson only
cucum_clemson <- cucum %>%
filter(location == "Clemson")
# Model for Clemson
latinmodel_Clem <- lm(yield ~ genotype + row + column,
data = cucum_clemson)
Normality test
Residuals generated from the model of the dataset are normally distributed, supported by,
cucum_clemson$resid <- resid(latinmodel_Clem)
ggplot(cucum_clemson, aes(sample = resid)) +
stat_qq_band() +
stat_qq_point() +
stat_qq_line()
shapiro.test(resid(latinmodel_Clem))
##
## Shapiro-Wilk normality test
##
## data: resid(latinmodel_Clem)
## W = 0.95458, p-value = 0.5655
Variance test
Levene tests are carried out for each of the fixed effect in the model to test for variance homogeneity. Results show that all variances in each of the fixed effect variable (genotype, row, and column) are homogeneous.
LeveneTest(yield ~ genotype, cucum) # equal variance
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.0868 0.9667
## 28
LeveneTest(yield ~ row, cucum) # equal variance
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.62 0.6079
## 28
LeveneTest(yield ~ column, cucum) # equal variance
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.932 0.4383
## 28
Base on the assumption tests, a two-way anova is carried out. Result shows that only yields among genotypes are significantly different from each other with a P-value of close to 0.01.
anova(latinmodel_Clem)
## Analysis of Variance Table
##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
## genotype 3 1316.80 438.93 9.3683 0.01110 *
## row 3 528.35 176.12 3.7589 0.07872 .
## column 3 411.16 137.05 2.9252 0.12197
## Residuals 6 281.12 46.85
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Following post-hoc test with P-value adjusted by Tukey method (alpha = 0.05) shows that genotype Dasher has the highest mean yield, and is significantly different from Sprint and Poinsett. There is no significant difference among Guardian, Sprint, and Poinsett Cucumber genotype.
ls_clem <- lsmeans(latinmodel_Clem, ~genotype)
lscld_clem <- cld(ls_clem, Letters = LETTERS, reversed = T)
lscld_clem
## genotype lsmean SE df lower.CL upper.CL .group
## Dasher 45.5 3.42 6 37.2 53.9 A
## Guardian 31.6 3.42 6 23.3 40.0 AB
## Sprint 25.9 3.42 6 17.6 34.3 B
## Poinsett 21.4 3.42 6 13.1 29.8 B
##
## Results are averaged over the levels of: row, column
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 4 estimates
## significance level used: alpha = 0.05
A statistical graph that visualise the lower and upper confidence level with the significance letters will be helpful.
ggplot(lscld_clem, aes(x = reorder(genotype, -lsmean), y = lsmean, colour = genotype)) +
geom_point(size = 4) +
geom_errorbar(aes(ymax = upper.CL, ymin = lower.CL)) +
geom_text(aes(label = .group), vjust = -5) +
theme_classic() +
labs(x = "Cucumber genotype",
y = "Yield (Weight in certain unit)",
title = "Cl Plot: Clemson",
subtitle = "Letters that overlap each other are not significant different") +
scale_y_continuous(lim = c(0, 60), breaks = seq(0, 60, 10))
This section imagines that the experiment only launched in Tifton.
Following code builds the model. A one-way ANOVA will be sufficient, as there is only one treatment now, which is the cucumber genotype and are completely randomised within the plot of Tifton in the style of Latin Square Design (Figure 2). Additionally, there is no any problematic trend in the plots within the Latin Square design of Tifton that would indicate differences may arise from row to row, or column to column.
# Filter out for a df that has only Tifton
cucum_tifton <- cucum %>%
filter(location == "Tifton")
# build the model
latinmodel_tif <- lm(yield ~ genotype, data = cucum_tifton)
Normality test
Residuals generated from the model of the dataset are normally distributed, supported by,
cucum_tifton$resid <- resid(latinmodel_tif)
ggplot(cucum_tifton, aes(sample = resid)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point()
shapiro.test(resid(latinmodel_tif))
##
## Shapiro-Wilk normality test
##
## data: resid(latinmodel_tif)
## W = 0.98136, p-value = 0.9734
Variance test
Levene’s tests are carried out for the fixed effect in the model to test for variance homogeneity. The test result shows that the variances among the genotypes of cucumber are equal, with a P-value of 0.9667 (> P-val 0.05).
LeveneTest(yield ~ genotype, cucum) # equal variance
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.0868 0.9667
## 28
A single ANOVA table shows that yield among the genotypes of cucumber is significantly different from each other, with a P-value of 0.04588 (higher than P-value of 0.05).
anova(latinmodel_tif)
## Analysis of Variance Table
##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
## genotype 3 642.2 214.068 3.5301 0.04855 *
## Residuals 12 727.7 60.642
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Dasher is the cucumber genotype that has the highest yield and is significantly higher than Poinsett genotype. However, it is not significantly different from Sprint and Guardian genotypes. Sprint and Guardian cucumber genotypes are not significantly higher than Poinsett.
ls_tif <- lsmeans(latinmodel_tif, ~genotype)
lscld_tif <- cld(ls_tif, Letters = LETTERS, reversed = T)
lscld_tif
## genotype lsmean SE df lower.CL upper.CL .group
## Dasher 50.5 3.89 12 42.0 59.0 A
## Sprint 39.2 3.89 12 30.7 47.7 AB
## Guardian 38.7 3.89 12 30.2 47.2 AB
## Poinsett 33.0 3.89 12 24.5 41.5 B
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 4 estimates
## significance level used: alpha = 0.05
A confidence interval plot (CI plot) to visualise the post-hoc result of Tifton.
ggplot(lscld_tif, aes(x = reorder(genotype, -lsmean), y = lsmean, colour = genotype)) +
geom_point(size = 4) +
geom_errorbar(aes(ymax = upper.CL, ymin = lower.CL)) +
geom_text(aes(label = .group), vjust = -6) +
theme_classic() +
labs(x = "Cucumber genotype",
y = "Yield (Weight in certain unit)",
title = "Cl Plot: Tifton",
subtitle = "Letters that overlap each other are not significant different") +
scale_y_continuous(lim = c(20, 70), breaks = seq(20, 70, 10))
This table summarises the results from section 7.1 and 7.2 that imagining the cucumber research were studied only in respective location.
Again, as mentioned earlier, the section 7.1 and 7.2 are just additional works to demonstrate how I would analyse individual Latin Square design. The entire statistical analysis can be accomplished in this section 7.3. This section applies multi-environment analysis and the results might be differ from the above table.
A full random model is built using following codes. In this model, I include excessive possible random effects in the dataset into the model, which include the genotype, location, and the interaction between location with genotype, row, and column, as well as interaction between genotype of row and column.
FRM <- lmer(yield ~
(1|genotype) +
(1|location) +
(1|genotype:location) +
(1|row:location) +
(1|column:location) +
(1|genotype:row) +
(1|genotype:column),
data = cucum
)
summary(FRM)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: yield ~ (1 | genotype) + (1 | location) + (1 | genotype:location) +
## (1 | row:location) + (1 | column:location) + (1 | genotype:row) +
## (1 | genotype:column)
## Data: cucum
##
## REML criterion at convergence: 227.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.72288 -0.57428 -0.01889 0.42454 1.87095
##
## Random effects:
## Groups Name Variance Std.Dev.
## genotype:column (Intercept) 5.470e-06 0.002339
## genotype:row (Intercept) 3.752e-06 0.001937
## column:location (Intercept) 1.817e+01 4.262939
## row:location (Intercept) 3.170e+01 5.630708
## genotype:location (Intercept) 0.000e+00 0.000000
## genotype (Intercept) 7.404e+01 8.604662
## location (Intercept) 2.801e+01 5.292454
## Residual 3.113e+01 5.579125
## Number of obs: 32, groups:
## genotype:column, 16; genotype:row, 16; column:location, 8; row:location, 8; genotype:location, 8; genotype, 4; location, 2
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 35.743 6.303 2.737 5.671 0.0139 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
The result of this model shows how well each possible group in the model in explaining the variances in the model. The comparison is made by proportion of variances by the overall variances in the group using acquired variances from the above statistical result, the result of the comparion is shown below.
FRM_var <- as.data.frame(VarCorr(FRM))
FRM_var %>%
mutate(pct_vcov = paste0(round(vcov / sum(vcov)*100,2), "%"),
pct_vcov = format(pct_vcov, scientific = F)) %>%
arrange(desc(vcov)) %>%
dplyr::select(grp, pct_vcov) %>%
kbl(align = "c") %>%
kable_classic("hover")
grp | pct_vcov |
---|---|
genotype | 40.45% |
row:location | 17.32% |
Residual | 17% |
location | 15.3% |
column:location | 9.93% |
genotype:column | 0% |
genotype:row | 0% |
genotype:location | 0% |
Insights:
This section calculate repeatability of cucumber genotype. following information are extracted from the above statistical result (section 7.3.1):
# Var_X1/(Var_X1 + (Var_X1:Location/nLoc) + Var_Res/nLoc*nRep))
74.02/(74.02 + (0.00/2) + 31.13/(1*4))
## [1] 0.9048623
Estimated cucumber repeatability is 90.4%. This high level of repeatability indicating that the genotypes are highly type performing, and the environment involved in the experiment could be very stable. This high level of repeatability also saying that any relevant, similar study in the future, I would get the same result. However, this repeatability can be affected by management and environmental conditions that the future similar study has.
Building the Mixed Effect Model
MEM <- lmer(yield ~
genotype +
(1|location) +
(1|genotype:location) +
(1|row:location) +
(1|column:location),
data = cucum)
Assumption tests
Assumptions of residual normality and variance homogeneity are met from following tests.
shapiro.test(resid(MEM))
##
## Shapiro-Wilk normality test
##
## data: resid(MEM)
## W = 0.98206, p-value = 0.8569
cucum$resid <- resid(MEM)
ggplot(cucum, aes(sample = resid)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point()
LeveneTest(yield ~ genotype, cucum)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.0868 0.9667
## 28
Omnibus test
With the results of assumption test above, ANOVA is selected to summarise the model. Results show that there is significant difference in yield among cucumber genotype (variety). It allows the next post-hoc test to be carried out to check how are each genotype different from each other.
anova(MEM)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## genotype 1869.8 623.28 3 15 20.023 1.673e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Post-Hoc test
Post-hoc test that using the P-value adjusted by Tukey method with a significance level of alpha = 0.05 has a result that “Dasher” cucumber genotype is the most performing variety and is significantly higher in yield than “Sprint” and “Poinsett”, but is not significantly different from “Guardian” genotype.
ls_MEM <- lsmeans(MEM, ~genotype)
lscld_MEM <- cld(ls_MEM, Letters = LETTERS, reversed = T)
lscld_MEM
## genotype lsmean SE df lower.CL upper.CL .group
## Dasher 48.0 4.91 1.29 10.48 85.6 A
## Guardian 35.2 4.91 1.29 -2.36 72.7 AB
## Sprint 32.6 4.91 1.29 -4.97 70.1 B
## Poinsett 27.2 4.91 1.29 -10.31 64.8 B
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 4 estimates
## significance level used: alpha = 0.05
This is a table summarises the final statistical result from section 7.1, 7.2, and 7.3. The unit of mean was not given by the original dataset but it can be assumed to be a unit of weight of marketable fruit per plot.
Section 7.1 imagines that this experiment was carried out in Clemson only. The section only uses Clemson data. The result shows that Dasher is the best performing cucumber genotype (or variety). It is significantly higher in yield than Sprint and Poinsett.
Section 7.2 imagines that this experiment was carried out in Tifton only. The section only uses Tifton data. The result shows that Dasher is the best performing cucumber genotype (or variety). It is significantly higher in yield than Poinsett only.
Section 7.3 applies multi-environment Latin Square model that takes environment aspects and other random effects into consideration when computing the statistical results. The model uses data from both location. The result shows that Dasher is the best performing cucumber geneotype, followed by Guardian, Sprint, the Poinsett. Dasher is significantly higher than Sprint and Poinsett in yield, but not significantly higher than Guardian genotype.
Although it is not significant, but all cucumber genotypes had better yield in Tifton than in Clemson.
Thank you for reading
This is a personal project created and designed for skill demonstration and non-commercial use only. All photos in this project, such as those that applied as the thumbnail are only for demonstration, they are not related to the dataset, the location where the data were collected, and the results of this analysis.
Ausmus S 2007, “Photograph of cucumber vine with fruits, flowers and leaves visible”, viewed 13 July 2021, https://en.wikipedia.org/wiki/Cucumber#/media/File:ARS_cucumber.jpg
William Bridges (1989). Analysis of a plant breeding experiment with heterogeneous variances using mixed model equations. Applications of mixed models in agriculture and related disciplines, S. Coop. Ser. Bull, 45–51.