Factorial Completely Randomized Design is an experimental design in which the treatment is formed by a combination of the levels of several factors. Factorial treatment structures are useful for examining the effects of two or more factors on a response y, whether or not interaction exists. The treatment design of the multi-factor experiment was differentiated based on the level of importance and restrictions on randomization of the levels of each factor making up the treatment.
factorial_design <- read.csv("factorial.csv")
attach(factorial_design)
head(factorial_design,10)
Pestcide_A Variety_B Response
1 a_1 b_1 49
2 a_2 b_1 50
3 a_3 b_1 43
4 a_4 b_1 53
5 a_1 b_1 39
6 a_2 b_1 55
7 a_3 b_1 38
8 a_4 b_1 48
9 a_1 b_2 55
10 a_2 b_2 67
model_aov <- aov(Response ~ Pestcide_A*Variety_B, data = factorial_design)
summary(model_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Pestcide_A 3 2227 742.5 17.556 0.00011 ***
Variety_B 2 3996 1998.0 47.244 2.05e-06 ***
Pestcide_A:Variety_B 6 457 76.2 1.801 0.18168
Residuals 12 508 42.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
knitr::include_graphics("factorial.png")
The first test of significance must be to test for an interaction between factors A and B, because if the interaction is significant then the main effects may have no interpretation
Interaction AB:The computed value of F=1.801 doesn’t exceed the tabulated value of 3.00 for alpha=0.05, df1=6, and df2=12 in the F tables. Hence, we have insufficient evidence to indicate an interaction between pesticide levels and variety of trees levels. Because the interaction is not significant, we can next test the main effects of the two factors.
Factor A:The computed value of F=17.556 does exceed the tabulated value of 3.49 for alpha=0.05, df1=3, and df2=12 in the F tables. Hence, we have sufficient evidence to indicate a significant difference in the mean yields among the four pesticide levels (Factor A).
Factor B:The computed value of F=47.244 does exceed the tabulated value of 3.89 for alpha=0.05, df1=2, and df2=12 in the F tables. Hence, we have sufficient evidence to indicate a significant difference in the mean yields among the three varieties of citrus trees. (Factor B).
knitr::include_graphics("split.png")
The main purpose of this tutorial is to demonstrate my experimentation and analytically skills using R programming language. This project is about analyzing an oat dataset collected from a Split-plot research system. It is a common but more advanced research method for a factorial agricultural experiment that works on multiple independent variables. The experiment was carried out in a field. The experimentation field was spitted into block, plot, and subplot.
There are 3 independent variables in the dataset with various levels. They are 6 blocks, 3 oat varieties, and 4 nutrient levels. Responding variable was plant yield in kilogram per hectare. Exploratory data analysis (EDA) was carried out to explore general trends of the dataset, then followed by statistical analyses. This project uses a mixed-effect model to analyse significant differences among oat varieties, nutrient levels (manurial nitrogen) and their interactions. Statistical assumptions were tests using Q-Q plot, Shapiro-Wilk test, and Levene’s test. Results from the test shown normality among residuals and variances among treatment groups were the same. ANOVA and Tukey were selected as the omnibus and post-hoc test.
Results reveal that there is no significant yield difference among oat varieties. However, there is significant yield difference between nutrient levels. There is no significant interaction effect between oat varieties and nutrient levels. If a decision has to be made, my top 3 best combinations selected from the dataset will be the three varieties, Marvellous, Golden rain, and Victory, with 50 kilogram per hectare of manurial nitrogen application. It is the second highest nutrient level from the experiment. I am not picking the highest nutrient level that had the best yield because the yield though was higher but is not significantly higher than my top 3 recommendations. These recommendations may also more beneficial for farm economics and surrounding environment.
knitr::include_graphics("highlight.png")
R packages loaded in this project include tidyverse packages (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats), skimr, lubridate, kableExtra, agridat, MASS, DescTools, DescTools, lme4, qqplotr, lsmeans, and multcomp.
library(systemfonts)
library(tidyverse)
library(skimr)
library(lubridate)
library(kableExtra)
library(agridat)
library(MASS)
library(DescTools)
library(lme4)
library(lmerTest)
library(qqplotr)
library(lsmeans)
library(multcomp)
This personal project uses a public dataset from an oat field experiment, the data was available for download from a R package called MASS. The objective of this project is to demonstrate how I successfully analyzed the data using statistical methods to find out the optimum nutrient levels on different variety of oats, with incorporation of the effects of blocks and plots.
The experiment was carried out in a field with a famous agricultural experimentation system called split plot. This method splits the entire allocated field into different blocks and further subdividing into smaller plots and even smaller subplots. In the dataset, there are 3 oat varieties and 4 levels of manurial treatment (nutrient level).
The experiment was,
A sketch to visualize the arrangement of blocks. The arrangement of each block is often subjected to the direction of gradient of identified heterogeneous environment.
knitr::include_graphics("split2.png")
I am expecting all treatments were randomised within plot. It is a usual way to minimize the impact of heterogeneous environmental conditions for more accurate results. For example, if the left corner has more soil moisture than the right corner, then having a randomised plots would avoid particular treatment be assigned to the heterogeneous corner and generating unfairn data.
knitr::include_graphics("split3.png")
Fundamentals of split-plot and RCBD are similar. RCBD (Randomized Complete Block Design) stops when the split achieves plots, whereas split plot further splits the plots into subplots. They are widely-used experimental designs in agriculture research to minimize heterogeneous field conditions. These conditions can be sporadic soil water availability, inherent nutrient contents, soil chemistry, or other environmental conditions in the research field. These effects on the crops can be ranged from small to large. Usually, when a heterogeneous condition is identified, blocks will be arranged in a gradient against the identified area. It is to ensure the heterogeneous area has all treatments in it. Therefore, it can reduce biases against particular treatments and reduce overall experimental errors because all treatments have now equal chance receiving the heterogeneous soil conditions.
In general, one can assume which variable in a split-plot experiment is the most important variable for the researchers by looking at where the variable is being plotted. Usually, the variable that assigned to the subplot is the variable that interest the researchers the most, which is “nutrient level on oat” in this case. The second interest of this split plot experiment is the variable assigned to the plot, which is the “varieties” of oat in this case.
To fully examine the yield of oats due to varieties and nutrient levels in a split plots. I will need to statistically analyse and compare the effects of varieties, nutrient levels, their interaction, and the effects of plots and subplots.
Following code import the data and the table indicates successful import of the dataset.
data("oats")
head(oats,10)
B V N Y
1 I Victory 0.0cwt 111
2 I Victory 0.2cwt 130
3 I Victory 0.4cwt 157
4 I Victory 0.6cwt 174
5 I Golden.rain 0.0cwt 117
6 I Golden.rain 0.2cwt 114
7 I Golden.rain 0.4cwt 161
8 I Golden.rain 0.6cwt 141
9 I Marvellous 0.0cwt 105
10 I Marvellous 0.2cwt 140
Following table describes what is each column of the dataset comprised of.
Variables <- c("B", "V", "N", "Y")
Description <- c("Blocks, levels I, II, III, IV, V, VI",
"Varieties, 3 levels",
"Nitrogen (manurial) treatment, levels 0.0cwt, 0.2cwt, 0.4cwt, 0.6cwt, showing the application in unit of cwt/acre",
"Yields in 1/4lbs per sub-plot, each of area 1/80 acre.")
data.frame(Variables, Description) %>%
kbl() %>%
kable_styling(bootstrap_options = c("bordered", "stripped", "hover"))
| Variables | Description |
|---|---|
| B | Blocks, levels I, II, III, IV, V, VI |
| V | Varieties, 3 levels |
| N | Nitrogen (manurial) treatment, levels 0.0cwt, 0.2cwt, 0.4cwt, 0.6cwt, showing the application in unit of cwt/acre |
| Y | Yields in 1/4lbs per sub-plot, each of area 1/80 acre. |
The dataset has 74 rows of observations and 4 columns of variables. There are 3 variables, B, V, and N, categorised as factor, and the Y is categorised as numerical data.
skim_without_charts(oats)
| Name | oats |
| Number of rows | 72 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| B | 0 | 1 | FALSE | 6 | I: 12, II: 12, III: 12, IV: 12 |
| V | 0 | 1 | FALSE | 3 | Gol: 24, Mar: 24, Vic: 24 |
| N | 0 | 1 | FALSE | 4 | 0.0: 18, 0.2: 18, 0.4: 18, 0.6: 18 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Y | 0 | 1 | 103.97 | 27.06 | 53 | 86 | 102.5 | 121.25 | 174 |
The data set is complete and having no missing data. It can be examined through the results of the columns n_missing and complete_rate. The two columns associate with missing value in each column of the dataset. All variables (B, V, N and Y) have 0 in the n_missing, and 1 in the complete_rate throughout their columns.
Following is the summary of the dataset.
summary(oats)
B V N Y
I :12 Golden.rain:24 0.0cwt:18 Min. : 53.0
II :12 Marvellous :24 0.2cwt:18 1st Qu.: 86.0
III:12 Victory :24 0.4cwt:18 Median :102.5
IV :12 0.6cwt:18 Mean :104.0
V :12 3rd Qu.:121.2
VI :12 Max. :174.0
The dataset is obviously has been cleaned by the up-loader of the dataset. The dataset has been transformed to a long format already, all variables have been assigned to their desired types, having no missing values and unnecessary spaces to trim.
However, variable names of B, V, N, and Y can be changed to a more complete, simple, intuitive terms such as block, variety, nutrient, and yield, to make them more sensible and understandable for readers who are not familiar with the dataset. Following codes complete the changes (click the right button), and the table reveals the outcome of the codes.
oat <- oats %>% # note: I change the name of oats to oat, to preserve original dataset.
rename("block" = "B",
"variety" = "V",
"nutrient" = "N",
"yield" = "Y")
head(oat,10)
block variety nutrient yield
1 I Victory 0.0cwt 111
2 I Victory 0.2cwt 130
3 I Victory 0.4cwt 157
4 I Victory 0.6cwt 174
5 I Golden.rain 0.0cwt 117
6 I Golden.rain 0.2cwt 114
7 I Golden.rain 0.4cwt 161
8 I Golden.rain 0.6cwt 141
9 I Marvellous 0.0cwt 105
10 I Marvellous 0.2cwt 140
The changes have been successfully completed. Now convert the units of nutrient and yield The unit of nutrient is cwt/arce, and the unit of yield is lbs/acre. Personally, I am more comfortable with the unit kg/ha or ton/ha. The conversion is optional but I will carry out the conversion in this section, with:
oat2 <- oat %>%
mutate(nutrient = str_remove(nutrient, "cwt"),
nutrient = as.numeric(nutrient),
nutrient = round(nutrient*50.8023/0.404686),
nutrient = as.factor(nutrient),
yield = round(yield * 0.453592/0.404686))
head(oat2,10)
block variety nutrient yield
1 I Victory 0 124
2 I Victory 25 146
3 I Victory 50 176
4 I Victory 75 195
5 I Golden.rain 0 131
6 I Golden.rain 25 128
7 I Golden.rain 50 180
8 I Golden.rain 75 158
9 I Marvellous 0 118
10 I Marvellous 25 157
attach(oat2)
Now, the units of nutrient and yield have been converted into kg/ha.
After cleaning, the very first step is always visualizing the data because it is the time one can have a chance to have a general understanding of the data set, what are the obvious trends, how data points are spread out, and how each levels of variables are visually different from each other.
There is no obvious yield difference among oat varieties. If a comparison is needed to be made, the Marvellous has the highest average yield, followed by Golden rain, then Victory. However, the differences are not much.
df6.1 <- oat2 %>%
dplyr::select(variety, yield) %>% # MASS package making select of dplyr error, having dplyr:: solve the issue.
group_by(variety) %>%
mutate(count = n(),
xlab = as.factor(paste0(variety, "\n", "(n = ", count, ")")))
ggplot(df6.1, aes(x = reorder(xlab, -yield), y = yield, fill = variety, colour = variety)) +
geom_boxplot(alpha = 0.1, outlier.shape = NA) +
stat_boxplot(geom = "errorbar") +
geom_jitter(width = 0.2) +
stat_summary(fun.y = mean, geom = "point", size = 7, shape = 4, stroke = 2) +
labs(x = "Oat Variety",
y = "Yield, kg/ha",
title = "Figure 1. There is no obvious yield different between Oat Varietes",
subtitle = " 'x' in the boxplot represents mean.") +
theme_bw() +
theme(legend.position = "none",
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
plot.title = element_text(face = "bold", vjust = 1))
The yield of oat increased with increasing nutrient content. A positive relationship is observed. However, a subtle insight should be noted which is the rate of yield increment from 50 kg/ha of nitrogen to 75 kg/ha is not as much as from 0 kg/ha to 25 kg/ha and from 25 kg/ha to 50 kg/ha.
df6.2 <- oat2 %>%
dplyr::select(nutrient, yield) %>% # MASS package making select of dplyr error, having dplyr:: solve the issue.
group_by(nutrient) %>%
mutate(count = n(),
xlab = as.factor(paste0(nutrient, "\n", "(n = ", count, ")")))
ggplot(df6.2, aes(x = reorder(xlab, yield), y = yield, fill = nutrient, colour = nutrient)) +
geom_boxplot(alpha = 0.1, outlier.shape = NA) +
stat_boxplot(geom = "errorbar") +
geom_jitter(width = 0.2) +
stat_summary(fun.y = mean, geom = "point", size = 7, shape = 4, stroke = 2) +
labs(x = "Nutrient levels, kg/ha",
y = "Yield, kg/ha",
title = "Figure 2. Oat yields increased with increasing Nutrient Content",
subtitle = " 'x' in the boxplot represents mean.") +
theme_bw() +
theme(legend.position = "none",
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
plot.title = element_text(face = "bold", vjust = 1))
An important question for farmer is that whether the smaller increment of yield at the 2 highest nutrient levels outweigh the cost of the additional nutrient added.
This section is important to observe yields among blocks. In a most basic experimental design, blocks are often treated as replicates and each replicates should be free from environmental noises, meaning the results between each block should be similar.
However, perhaps Block 1 has been identified to have a condition that favors plant growth. This would be a reason why split-plot is selected compared to a CRD (completely randomized design) to account for the heterogeneous condition.
df6.3 <- oat2 %>%
dplyr::select(block, yield) %>% # MASS package making select of dplyr error, having dplyr:: solve the issue.
group_by(block) %>%
mutate(count = n(),
xlab = as.factor(paste0(block, "\n", "(n = ", count, ")")))
ggplot(df6.3, aes(x = reorder(xlab, -yield), y = yield, fill = block, colour = block)) +
geom_boxplot(alpha = 0.1, outlier.shape = NA) +
stat_boxplot(geom = "errorbar") +
geom_jitter(width = 0.2) +
stat_summary(fun.y = mean, geom = "point", size = 7, shape = 4, stroke = 2) +
labs(x = "Block",
y = "Yield, kg/ha",
title = "Figure 3. Yield in Block I has the Highest Yield than Other Blocks",
subtitle = " 'x' in the boxplot represents mean.") +
theme_bw() +
theme(legend.position = "none",
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
plot.title = element_text(face = "bold", vjust = 1))
Again, the type of statistical analysis from the split-block design allows one to account for the errors (noises) between blocks. Additionally, difference among plots will also be accounted in the statistical analysis section, which further increases the accuracy of the result.
Following graph summarizes the effects of oat varieties and nutrient contents in the soil on yield.
df6.4 <- oat2 %>%
dplyr::select(variety, nutrient, yield) # MASS package making select of dplyr error, having dplyr:: solve the issue.
ggplot(df6.4, aes(x = variety, y = yield, fill = nutrient)) +
geom_boxplot(alpha = 0.8, outlier.shape = NA) +
stat_boxplot(geom = "errorbar") +
geom_point(position = position_jitterdodge(), width = 0.1, size = 2, shape = 21, aes(fill = nutrient)) +
stat_summary(fun.y = mean, geom = "point", size = 5, shape = 4, position = position_jitterdodge(0), colour = "blue", stroke = 1.5) +
labs(x = "Block",
y = "Yield, kg/ha",
title = "Figure 4. The Effect of Variety + Nutrient on Oat Yield.",
subtitle = " 'x' in the boxplot represents mean.",
fill = "Nutrient level, kg/ha: ") +
theme_bw() +
theme(legend.position = "bottom",
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
plot.title = element_text(face = "bold", vjust = 1)) +
scale_fill_manual(values = c("yellow", "green1", "green2", "green4"))
Following figure 5 has too many variables and making it less effective in story telling. I hope my addition of light to dark green to highlight different nutrient levels will aid you a little.
Insights:
df6.5 <- oat2 %>%
group_by(block, variety) %>%
mutate(count = n(),
xlab = as.factor(paste0(variety, "\n", "(n = ", count, ")")))
ggplot(df6.5, aes(x = variety, y = yield, fill = nutrient)) +
geom_bar(stat = "identity", position = "dodge", colour = "black") +
facet_wrap(~block) +
labs(x = "Oat Variety",
y = "Yield, kg/ha",
fill = "Nutrient level, kg/ha: ",
title = "Figure 5. Oat Yield of Different Treatments in Different Blocks") +
theme_bw() +
theme(legend.position = "bottom",
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
plot.title = element_text(face = "bold", vjust = 1)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
scale_fill_manual(values = c("yellow", "green1", "green2", "green4"))
The statistical analysis for split plot design requires a model that takes both fixed and random effects into account. Fixed effect include the variety, nutrient, and the interaction of these two variables. Random effects include the blocks and the plots. Assuming some residuals may be the effect of blocks and plots due to their inherent differences despite the effects might be insignificant and the researchers have done their best in choosing a research field. The model will still incorporate and consider errors due to these random effects.
Prior to all processes, my usual first step is assumptions checking, it includes the testing of the normality of residuals, and data variances among all possible combination of treatments. This steps are important to decide what omnibus and post-hoc tests.
model <- lmer(yield ~ variety + nutrient + variety:nutrient + (1|block) + (1|block:variety),
data = oat2)
summary(model)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: yield ~ variety + nutrient + variety:nutrient + (1 | block) +
(1 | block:variety)
Data: oat2
REML criterion at convergence: 542.8
Scaled residuals:
Min 1Q Median 3Q Max
-1.81365 -0.55118 0.02205 0.64379 1.57713
Random effects:
Groups Name Variance Std.Dev.
block:variety (Intercept) 133.3 11.54
block (Intercept) 268.9 16.40
Residual 223.1 14.94
Number of obs: 72, groups: block:variety, 18; block, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 89.6667 10.2086 16.1263 8.783 1.52e-07 ***
varietyMarvellous 7.6667 10.8993 30.2658 0.703 0.4872
varietyVictory -9.8333 10.8993 30.2658 -0.902 0.3741
nutrient25 20.6667 8.6237 45.0000 2.397 0.0208 *
nutrient50 38.6667 8.6237 45.0000 4.484 5.02e-05 ***
nutrient75 50.3333 8.6237 45.0000 5.837 5.45e-07 ***
varietyMarvellous:nutrient25 3.6667 12.1957 45.0000 0.301 0.7651
varietyVictory:nutrient25 0.1667 12.1957 45.0000 0.014 0.9892
varietyMarvellous:nutrient50 -4.6667 12.1957 45.0000 -0.383 0.7038
varietyVictory:nutrient50 5.8333 12.1957 45.0000 0.478 0.6347
varietyMarvellous:nutrient75 -5.5000 12.1957 45.0000 -0.451 0.6542
varietyVictory:nutrient75 2.6667 12.1957 45.0000 0.219 0.8279
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) vrtyMr vrtyVc ntrn25 ntrn50 ntrn75 vrM:25 vrV:25 vrM:50
vartyMrvlls -0.534
varityVctry -0.534 0.500
nutrient25 -0.422 0.396 0.396
nutrient50 -0.422 0.396 0.396 0.500
nutrient75 -0.422 0.396 0.396 0.500 0.500
vrtyMrvl:25 0.299 -0.559 -0.280 -0.707 -0.354 -0.354
vrtyVctr:25 0.299 -0.280 -0.559 -0.707 -0.354 -0.354 0.500
vrtyMrvl:50 0.299 -0.559 -0.280 -0.354 -0.707 -0.354 0.500 0.250
vrtyVctr:50 0.299 -0.280 -0.559 -0.354 -0.707 -0.354 0.250 0.500 0.500
vrtyMrvl:75 0.299 -0.559 -0.280 -0.354 -0.354 -0.707 0.500 0.250 0.500
vrtyVctr:75 0.299 -0.280 -0.559 -0.354 -0.354 -0.707 0.250 0.500 0.250
vrV:50 vrM:75
vartyMrvlls
varityVctry
nutrient25
nutrient50
nutrient75
vrtyMrvl:25
vrtyVctr:25
vrtyMrvl:50
vrtyVctr:50
vrtyMrvl:75 0.250
vrtyVctr:75 0.500 0.500
Residues generated from the mixed effect model of the dataset are normally distributed, supported by the following Q-Q plot that most of the points are lying close to the line and are mostly within the 95% confidence shaded area around the line.
oat2$resid <- resid(model)
ggplot(oat2, aes(sample = resid)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point() +
labs(title = "Q-Q plot",
subtitle = "model = lmer(yield ~ variety + nutrient + variety:nutrient + (1|block) + (1|block:variety),
data = oat2)") +
theme(plot.title = element_text(face = "bold"))
The normality is also supported by Shapiro-Wilk test that test on the residuals of the model and having a p-value of 0.2811, which is higher than 0.05, and fail to reject the null hypothesis that the residuals are normally distributed.
shapiro.test(resid(model))
Shapiro-Wilk normality test
data: resid(model)
W = 0.97928, p-value = 0.2811
Applying Levene’s test to test the spread of variances among levels within two of the fixed effect variables - “variety” and “nutrient”. Levene’s tests show that variances among the treatment groups of variety are equal to each other, with P-value higher than 0.05. The variances among the treatment groups of nutrient level also equal to each other (P-value > 0.05).
LeveneTest(oat2$yield ~ oat2$variety)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.4424 0.6443
69
LeveneTest(oat2$yield ~ oat2$block)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 0.3629 0.8721
66
LeveneTest(oat2$yield ~ oat2$nutrient)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 0.0501 0.9851
68
Based on the results of assumption tests in the previous section, anova is selected as the omnibus test to find out if there are differences between groups.
print(anova(model))
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
variety 668.4 334.2 2 10 1.4979 0.2698
nutrient 25195.2 8398.4 3 45 37.6436 2.502e-12 ***
variety:nutrient 418.0 69.7 6 45 0.3122 0.9273
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Computed ANOVA table shows that the F-value of nutrient treatment is very high which indicating that the variances between groups is significantly higher than variances within group and resulting a highly significant P-value of way lesser than 0.05. There are no significant difference between oat varieties, as well as interaction between oat varieties and nutrient levels.
Since there is significant result from the omnibus test, post-hoc test is allowed to be proceeded. The column “.group” shows a summary of the statistical results with letters. Letters that overlapped each other are not significant different or otherwise (P = 0.05).
ls_n <- lsmeans(model, ~nutrient)
lscld_n <- cld(ls_n, Letters = LETTERS, reversed = T)
lscld_n
nutrient lsmean SE df lower.CL upper.CL .group
75 138.3 8.04 6.8 119.2 157 A
50 128.0 8.04 6.8 108.9 147 A
25 110.9 8.04 6.8 91.8 130 B
0 88.9 8.04 6.8 69.8 108 C
Results are averaged over the levels of: variety
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
P value adjustment: tukey method for comparing a family of 4 estimates
significance level used: alpha = 0.05
NOTE: If two or more means share the same grouping symbol,
then we cannot show them to be different.
But we also did not show them to be the same.
A 95% confidence interval plot is also a very helpful visualization.
ggplot(lscld_n, aes(x = nutrient, y = lsmean, colour = nutrient)) +
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), size = 2) +
geom_point(size = 6) +
geom_text(aes(label = .group), vjust = -6, stroke = 1.5) +
scale_y_continuous(lim = c(60, 180), breaks = seq(60, 180, 20)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(face = "bold")) +
labs(title = "CI Plot: Nutrient levels",
x = "Nutrient level, kg/ha",
y = "Plant yield, kg/ha")
Insights: * Nutrients levels at 50 kg/ha and 75 kg/ha have significant higher plant yield than 0 kg/ha and 25 kg/ha. * However, two of the 50 kg/ha and 75 kg/ha nutrients level are not significantly different from each other. * This plot does not separate out the data from 3 varieties of oats.
A final overall post-hoc test will be computed to obverse overall differences.
Following shows the statistical results between the treatment groups of variety and nutrient. Letters that overlap each other indicating not significant different (P = 0.05).
ls_vn <- lsmeans(model, ~ variety * nutrient)
lscld_vn <- cld(ls_vn, Letters = LETTERS, reversed = T)
lscld_vn
variety nutrient lsmean SE df lower.CL upper.CL .group
Marvellous 75 142.2 10.2 16.1 120.5 164 A
Golden.rain 75 140.0 10.2 16.1 118.4 162 A
Victory 75 132.8 10.2 16.1 111.2 154 A C
Marvellous 50 131.3 10.2 16.1 109.7 153 AB
Golden.rain 50 128.3 10.2 16.1 106.7 150 ABCD
Victory 50 124.3 10.2 16.1 102.7 146 ABCDE
Marvellous 25 121.7 10.2 16.1 100.0 143 ABCDE
Golden.rain 25 110.3 10.2 16.1 88.7 132 ABCDEF
Victory 25 100.7 10.2 16.1 79.0 122 B DEF
Marvellous 0 97.3 10.2 16.1 75.7 119 CDEF
Golden.rain 0 89.7 10.2 16.1 68.0 111 EF
Victory 0 79.8 10.2 16.1 58.2 101 F
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
P value adjustment: tukey method for comparing a family of 12 estimates
significance level used: alpha = 0.05
NOTE: If two or more means share the same grouping symbol,
then we cannot show them to be different.
But we also did not show them to be the same.
Insights:
ggplot(lscld_vn, aes(x = nutrient, y = lsmean, colour = variety)) +
geom_path(aes(group = variety), size = 1.5) +
geom_point(size = 4) +
theme_classic() +
theme(legend.position = "none") +
labs(title = "Interaction between Nutrient Levels and Variety",
x = "Nutrient level, kg/ha",
y = "Plant yield (average), kg/ha") +
geom_text(data = lscld_vn %>% filter(nutrient == "75"),
aes(label = variety),
hjust = -0.2)
Base on the data from this dataset, result shows that there is no significant yield difference among Marvellous, Golden rain and Victory, with P-value of higher than 0.05. However, there is significant yield difference between nutrient levels applied in the experiment (P-value < 0.05). It implies that as long as any higher rate of given nutrient level in the experiment is added, any oat variety that used in the experiment can outcompete another variety used in the experiment, though the result might be insignificant.
It should be noted that although the highest nutrient level applied in the experiment (75 kg/ha) has a higher yield than the second highest nutrient level (50 kg/ha), the difference was not significant. The suitability of adopting the highest nutrient level in this experiment should rely on the farmer’s context, economical-feasibility, and local environmental concerns. On the other aspect, there is no significant interaction effect between oat varieties and nutrient levels applied in the experiment. Base on the variety and nutrient levels applied in the experiment, there was no evidence from the statistical result showing that one oat variety may response to the nutrient significantly than another (P-value > 0.05).
If a decision has to be made to choose the best combination of oat variety and nutrient level base on the result of this project, my own top 3 best combinations are the three varieties, Marvellous, Golden rain, and Victory, with 50 kg per ha of manurial nitrogen application. Marvellous will be my most preferred oat variety, then followed by Golden rain, and lastly the victory oat variety. I am avoiding the maximum yield with the highest rate of nitrogen application as the result of yield gained was proven not significantly (P-value > 0.05) higher from my top 3 recommendations, I am also aiming for a better farm economics and environmental outcome.
This is a personal project created and designed for skill demonstration and non-commercial use only. All photos in this project, such as those that applied as the thumbnail are just for demonstration only, they are not related to the dataset, the location where the data were collected, and the results of this analysis.
McDonald G 2021, Raised beds – design, layout, construction and renovation, view 10 July 2021, https://www.agric.wa.gov.au/waterlogging/raised-beds-%E2%80%93-design-layout-construction-and-renovation
Sendelbach 2005, Avena sativa L (oat), viewed 05 July 2021, https://en.wikipedia.org/wiki/Oat#/media/File:Avena_sativa_L.jpg
smartcopying n.d., Quick Guide to Creative Commons, viewed 04 July 2021, https://smartcopying.edu.au/quick-guide-to-creative-commons/
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
Wright 2021, Package ‘agridat’, CRAN, viewed 04 July 2021, http://kwstat.github.io/agridat/
Yates, F. (1935) Complex experiments, Journal of the Royal Statistical Society Suppl. 2, 181–247.
Yates, F. (1970) Experimental design: Selected papers of Frank Yates, C.B.E, F.R.S. London: Griffin.
data <- read.csv("split plot design.csv")
head(data,10)
Replication Nitrogen Varieties Yield..kgs.
1 1 Nitrogen (0kgs/ha) s1 4430
2 1 Nitrogen (0kgs/ha) s2 3944
3 1 Nitrogen (0kgs/ha) s3 3464
4 1 Nitrogen (0kgs/ha) s4 4426
5 1 Nitrogen (30kgs/ha) s1 5418
6 1 Nitrogen (30kgs/ha) s2 6562
7 1 Nitrogen (30kgs/ha) s3 4768
8 1 Nitrogen (30kgs/ha) s4 5192
9 1 Nitrogen (60kgs/ha) s1 6076
10 1 Nitrogen (60kgs/ha) s2 6008
data$Nitrogen <- as.factor(data$Nitrogen)
data$Varieties <- as.factor(data$Varieties)
data$Replication <- as.factor(data$Replication)
data$Yield..kgs. <- as.numeric((data$Yield..kgs.))
head(data,10)
Replication Nitrogen Varieties Yield..kgs.
1 1 Nitrogen (0kgs/ha) s1 4430
2 1 Nitrogen (0kgs/ha) s2 3944
3 1 Nitrogen (0kgs/ha) s3 3464
4 1 Nitrogen (0kgs/ha) s4 4426
5 1 Nitrogen (30kgs/ha) s1 5418
6 1 Nitrogen (30kgs/ha) s2 6562
7 1 Nitrogen (30kgs/ha) s3 4768
8 1 Nitrogen (30kgs/ha) s4 5192
9 1 Nitrogen (60kgs/ha) s1 6076
10 1 Nitrogen (60kgs/ha) s2 6008
str(data)
'data.frame': 72 obs. of 4 variables:
$ Replication: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ Nitrogen : Factor w/ 6 levels "Nitrogen (0kgs/ha)",..: 1 1 1 1 5 5 5 5 6 6 ...
$ Varieties : Factor w/ 4 levels "s1","s2","s3",..: 1 2 3 4 1 2 3 4 1 2 ...
$ Yield..kgs.: num 4430 3944 3464 4426 5418 ...
attach(data)
After cleaning, the very first step is always visualizing the data because it is the time one can have a chance to have a general understanding of the data set, what are the obvious trends, how data points are spread out, and how each levels of variables are visually different from each other.
There is no obvious yield difference among oat varieties. If a comparison is needed to be made, the Marvellous has the highest average yield, followed by Golden rain, then Victory. However, the differences are not much.
mydf <- data %>%
dplyr::select(Varieties, Yield..kgs.) %>% # MASS package making select of dplyr error, having dplyr:: solve the issue.
group_by(Varieties) %>%
mutate(count = n(),
xlab = as.factor(paste0(Varieties, "\n", "(n = ", count, ")")))
ggplot(mydf, aes(x = reorder(xlab, -Yield..kgs.), y = Yield..kgs., fill = Varieties, colour = Varieties)) +
geom_boxplot(alpha = 0.1, outlier.shape = NA) +
stat_boxplot(geom = "errorbar") +
geom_jitter(width = 0.2) +
stat_summary(fun.y = mean, geom = "point", size = 7, shape = 4, stroke = 2) +
labs(x = "Variety",
y = "Yield, kg/ha",
title = "Figure 1. There is no obvious yield difference between Varieties",
subtitle = " 'x' in the boxplot represents mean.") +
theme_bw() +
theme(legend.position = "none",
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
plot.title = element_text(face = "bold", vjust = 1))
mydf2 <- data %>%
dplyr::select(Nitrogen, Yield..kgs.) %>% # MASS package making select of dplyr error, having dplyr:: solve the issue.
group_by(Nitrogen) %>%
mutate(count = n(),
xlab = as.factor(paste0(Nitrogen, "\n", "(n = ", count, ")")))
ggplot(mydf2, aes(x = reorder(xlab, -Yield..kgs.), y = Yield..kgs., fill = Nitrogen, colour = Nitrogen)) +
geom_boxplot(alpha = 0.1, outlier.shape = NA) +
stat_boxplot(geom = "errorbar") +
geom_jitter(width = 0.2) +
stat_summary(fun.y = mean, geom = "point", size = 7, shape = 4, stroke = 2) +
labs(x = "Nitrogen",
y = "Yield, kg/ha",
title = "Figure 1. There is no obvious yield difference between Nitrogen Levels",
subtitle = " 'x' in the boxplot represents mean.") +
theme_bw() +
theme(legend.position = "none",
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
plot.title = element_text(face = "bold", vjust = 1))
mydf <- data %>%
dplyr::select(Replication, Yield..kgs.) %>% # MASS package making select of dplyr error, having dplyr:: solve the issue.
group_by(Replication) %>%
mutate(count = n(),
xlab = as.factor(paste0(Replication, "\n", "(n = ", count, ")")))
ggplot(mydf, aes(x = reorder(xlab, -Yield..kgs.), y = Yield..kgs., fill = Replication, colour = Replication)) +
geom_boxplot(alpha = 0.1, outlier.shape = NA) +
stat_boxplot(geom = "errorbar") +
geom_jitter(width = 0.2) +
stat_summary(fun.y = mean, geom = "point", size = 7, shape = 4, stroke = 2) +
labs(x = "Replication",
y = "Yield, kg/ha",
title = "Figure 1. There is no obvious yield difference between Replications",
subtitle = " 'x' in the boxplot represents mean.") +
theme_bw() +
theme(legend.position = "none",
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
plot.title = element_text(face = "bold", vjust = 1))
library(dplyr)
SUM<- group_by(data, Nitrogen) %>%
summarise(
count = n(),
mean = mean(Yield..kgs., na.rm = TRUE),
sd = sd(Yield..kgs., na.rm = TRUE),
median = median(Yield..kgs., na.rm = TRUE),
max = max(Yield..kgs., na.rm = TRUE),
min = min(Yield..kgs., na.rm = TRUE),
IQR = IQR(Yield..kgs., na.rm = TRUE)
)
SUM
# A tibble: 6 × 8
Nitrogen count mean sd median max min IQR
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Nitrogen (0kgs/ha) 12 4284. 671. 4428 5318 3142 1053
2 Nitrogen (120kgs/ha) 12 5915. 1362. 6416 7139 2774 900.
3 Nitrogen (150kgs/ha) 12 6068. 2024. 6628 7848 1414 1356.
4 Nitrogen (180kgs/ha) 12 5845. 2552. 6117 8832 1960 2981.
5 Nitrogen (30kgs/ha) 12 4912. 1011. 5001 6562 3142 1204
6 Nitrogen (60kgs/ha) 12 5771. 859. 6045 6704 4146 844
library(dplyr)
SUM2<- group_by(data, Varieties) %>%
summarise(
count = n(),
mean = mean(Yield..kgs., na.rm = TRUE),
sd = sd(Yield..kgs., na.rm = TRUE),
median = median(Yield..kgs., na.rm = TRUE),
max = max(Yield..kgs., na.rm = TRUE),
min = min(Yield..kgs., na.rm = TRUE),
IQR = IQR(Yield..kgs., na.rm = TRUE)
)
SUM2
# A tibble: 4 × 8
Varieties count mean sd median max min IQR
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 s1 18 5655. 1763. 5581 8818 1960 2194
2 s2 18 6184. 1414. 6491 8832 3660 1462.
3 s3 18 5626. 1287. 5936 7387 3142 1302
4 s4 18 4398. 1637. 4691 7122 1414 2601.
library(dplyr)
SUM<- group_by(data, Replication) %>%
summarise(
count = n(),
mean = mean(Yield..kgs., na.rm = TRUE),
sd = sd(Yield..kgs., na.rm = TRUE),
median = median(Yield..kgs., na.rm = TRUE),
max = max(Yield..kgs., na.rm = TRUE),
min = min(Yield..kgs., na.rm = TRUE),
IQR = IQR(Yield..kgs., na.rm = TRUE)
)
SUM
# A tibble: 3 × 8
Replication count mean sd median max min IQR
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 24 5385. 1751. 5693 8452 1414 2058
2 2 24 5881. 1396. 5869 8832 1960 1729
3 3 24 5132. 1735. 5158 8818 2014 2907
x <- aov(Yield..kgs. ~ Varieties:Replication:Nitrogen, data = data)
model.tables(x, "mean")
Tables of means
Grand mean
5465.861
Varieties:Replication:Nitrogen
, , Nitrogen = Nitrogen (0kgs/ha)
Replication
Varieties 1 2 3
s1 4430 4478 3850
s2 3944 5318 3660
s3 3464 4914 3142
s4 4426 4944 4836
, , Nitrogen = Nitrogen (120kgs/ha)
Replication
Varieties 1 2 3
s1 6462 5744 6580
s2 7139 7056 6564
s3 5792 6982 6370
s4 2774 5880 3639
, , Nitrogen = Nitrogen (150kgs/ha)
Replication
Varieties 1 2 3
s1 7290 5036 7552
s2 7682 7848 6576
s3 7080 6594 6320
s4 1414 6662 2766
, , Nitrogen = Nitrogen (180kgs/ha)
Replication
Varieties 1 2 3
s1 8452 1960 8818
s2 6228 8832 6006
s3 5594 7387 5480
s4 2248 7122 2014
, , Nitrogen = Nitrogen (30kgs/ha)
Replication
Varieties 1 2 3
s1 5418 4482 3850
s2 6562 5166 3660
s3 4768 5858 3142
s4 5192 6004 4836
, , Nitrogen = Nitrogen (60kgs/ha)
Replication
Varieties 1 2 3
s1 6076 4604 6704
s2 6008 6420 6642
s3 6244 6127 6014
s4 4546 5724 4146
library(agricolae)
model <- with(data, sp.plot(Replication, Nitrogen, Varieties, Yield..kgs.))
ANALYSIS SPLIT PLOT: Yield..kgs.
Class level information
Nitrogen : Nitrogen (0kgs/ha) Nitrogen (30kgs/ha) Nitrogen (60kgs/ha) Nitrogen (120kgs/ha) Nitrogen (150kgs/ha) Nitrogen (180kgs/ha)
Varieties : s1 s2 s3 s4
Replication : 1 2 3
Number of observations: 72
Analysis of Variance Table
Response: Yield..kgs.
Df Sum Sq Mean Sq F value Pr(>F)
Replication 2 6968351 3484175 NaN NaN
Nitrogen 5 30077112 6015422 9.4205 0.001525 **
Ea 10 6385433 638543 NaN NaN
Varieties 3 30893555 10297852 4.4019 0.009748 **
Nitrogen:Varieties 15 32967362 2197824 0.9395 0.532468
Eb 36 84218377 2339399 NaN NaN
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cv(a) = 14.6 %, cv(b) = 28 %, Mean = 5465.861
summary(model)
Length Class Mode
ANOVA 5 anova list
gl.a 1 -none- numeric
gl.b 1 -none- numeric
Ea 1 -none- numeric
Eb 1 -none- numeric
head(data,4)
Replication Nitrogen Varieties Yield..kgs.
1 1 Nitrogen (0kgs/ha) s1 4430
2 1 Nitrogen (0kgs/ha) s2 3944
3 1 Nitrogen (0kgs/ha) s3 3464
4 1 Nitrogen (0kgs/ha) s4 4426
library(doebioresearch)
model <- splitplot(data[4],Replication, Nitrogen, Varieties,1)
model
$Yield..kgs.
$Yield..kgs.[[1]]
$Yield..kgs.[[1]][[1]]
Analysis of Variance Table
Response: dependent.var
Df Sum Sq Mean Sq F value Pr(>F)
block 2 6968351 3484175 5.4564 0.024999 *
main.plot 5 30077112 6015422 9.4205 0.001525 **
Ea 10 6385433 638543
sub.plot 3 30893555 10297852 4.4019 0.009748 **
main.plot:sub.plot 15 32967362 2197824 0.9395 0.532468
Eb 36 84218377 2339399
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
$Yield..kgs.[[1]][[2]]
[1] "CV(a): 14.62 , CV(b) : 27.983"
$Yield..kgs.[[1]][[3]]
[1] "R Square 0.56"
$Yield..kgs.[[1]][[4]]
Shapiro-Wilk normality test
data: model$residuals
W = 0.88825, p-value = 1.068e-05
$Yield..kgs.[[1]][[5]]
[1] "Normality assumption is violated"
$Yield..kgs.[[1]][[6]]
[1] "The means of one or more levels of main plot factor are not same, so go for multiple comparison test"
$Yield..kgs.[[1]][[7]]
$Yield..kgs.[[1]][[7]][[1]]
MSerror Df Mean CV t.value LSD
638543.3 10 5465.861 14.61964 2.228139 726.8785
$Yield..kgs.[[1]][[7]][[2]]
dependent.var groups
Nitrogen (150kgs/ha) 6068.333 a
Nitrogen (120kgs/ha) 5915.167 a
Nitrogen (180kgs/ha) 5845.083 a
Nitrogen (60kgs/ha) 5771.250 a
Nitrogen (30kgs/ha) 4911.500 b
Nitrogen (0kgs/ha) 4283.833 b
$Yield..kgs.[[1]][[8]]
[1] "The means of one or more levels of sub plot factor are not same, so go for multiple comparison test"
$Yield..kgs.[[1]][[9]]
$Yield..kgs.[[1]][[9]][[1]]
MSerror Df Mean CV t.value LSD
2339399 36 5465.861 27.98296 2.028094 1033.996
$Yield..kgs.[[1]][[9]][[2]]
dependent.var groups
s2 6183.944 a
s1 5654.778 a
s3 5626.222 a
s4 4398.500 b
$Yield..kgs.[[1]][[10]]
[1] "All the interaction level means are same so dont go for any multiple comparison test"
$Yield..kgs.[[1]][[11]]
$Yield..kgs.[[1]][[11]][[1]]
MSerror Df Mean CV t.value LSD
2339399 36 5465.861 27.98296 2.028094 2532.763
$Yield..kgs.[[1]][[11]][[2]]
dependent.var groups
Nitrogen (150kgs/ha):s2 7368.667 a
Nitrogen (180kgs/ha):s2 7022.000 ab
Nitrogen (120kgs/ha):s2 6919.667 ab
Nitrogen (150kgs/ha):s3 6664.667 abc
Nitrogen (150kgs/ha):s1 6626.000 abcd
Nitrogen (180kgs/ha):s1 6410.000 abcd
Nitrogen (120kgs/ha):s3 6381.333 abcd
Nitrogen (60kgs/ha):s2 6356.667 abcde
Nitrogen (120kgs/ha):s1 6262.000 abcdef
Nitrogen (180kgs/ha):s3 6153.667 abcdef
Nitrogen (60kgs/ha):s3 6128.333 abcdefg
Nitrogen (60kgs/ha):s1 5794.667 abcdefg
Nitrogen (30kgs/ha):s4 5344.000 abcdefg
Nitrogen (30kgs/ha):s2 5129.333 abcdefg
Nitrogen (60kgs/ha):s4 4805.333 bcdefg
Nitrogen (0kgs/ha):s4 4735.333 bcdefg
Nitrogen (30kgs/ha):s3 4589.333 bcdefg
Nitrogen (30kgs/ha):s1 4583.333 bcdefg
Nitrogen (0kgs/ha):s2 4307.333 cdefg
Nitrogen (0kgs/ha):s1 4252.667 cdefg
Nitrogen (120kgs/ha):s4 4097.667 defg
Nitrogen (0kgs/ha):s3 3840.000 efg
Nitrogen (180kgs/ha):s4 3794.667 fg
Nitrogen (150kgs/ha):s4 3614.000 g
Design and analysis of experiments / Douglas C. Montgomery. — Eighth edition. ISBN 978-1-118-14692-7
DoEOpt04 <- data.frame(x1 = c(-1,-1,1,1,0,0,0,0,0),
x2 = c(-1,1,-1,1,0,0,0,0,0),
Time = c(30,30,40,40,35,35,35,35,35),
Temp = c(150,160,150,160,155,155,155,155,155),
Y = c(39.3,40,40.9,41.5,40.3,40.5,40.7,40.2,40.6)
)
head(DoEOpt04,5)
x1 x2 Time Temp Y
1 -1 -1 30 150 39.3
2 -1 1 30 160 40.0
3 1 -1 40 150 40.9
4 1 1 40 160 41.5
5 0 0 35 155 40.3
View(DoEOpt04)
library(rsm)
str(DoEOpt04)
'data.frame': 9 obs. of 5 variables:
$ x1 : num -1 -1 1 1 0 0 0 0 0
$ x2 : num -1 1 -1 1 0 0 0 0 0
$ Time: num 30 30 40 40 35 35 35 35 35
$ Temp: num 150 160 150 160 155 155 155 155 155
$ Y : num 39.3 40 40.9 41.5 40.3 40.5 40.7 40.2 40.6
knitr::include_graphics("rsm.png")
DoEOpt04 <- as.coded.data(DoEOpt04,
x1 ~ (Time-35)/5,
x2 ~ (Temp-155)/5)
model <- rsm(Y ~ FO(x1,x2) + TWI(x1,x2), data = DoEOpt04)
summary(model)
Call:
rsm(formula = Y ~ FO(x1, x2) + TWI(x1, x2), data = DoEOpt04)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.444444 0.062311 649.0693 1.648e-13 ***
x1 0.775000 0.093467 8.2917 0.0004166 ***
x2 0.325000 0.093467 3.4772 0.0177127 *
x1:x2 -0.025000 0.093467 -0.2675 0.7997870
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple R-squared: 0.9418, Adjusted R-squared: 0.9069
F-statistic: 26.97 on 3 and 5 DF, p-value: 0.00163
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
FO(x1, x2) 2 2.82500 1.41250 40.4213 0.0008188
TWI(x1, x2) 1 0.00250 0.00250 0.0715 0.7997870
Residuals 5 0.17472 0.03494
Lack of fit 1 0.00272 0.00272 0.0633 0.8137408
Pure error 4 0.17200 0.04300
Stationary point of response surface:
x1 x2
13 31
Stationary point in original units:
Time Temp
100 310
Eigenanalysis:
eigen() decomposition
$values
[1] 0.0125 -0.0125
$vectors
[,1] [,2]
x1 -0.7071068 -0.7071068
x2 0.7071068 -0.7071068
knitr::include_graphics("rsm2.png")
The interaction term in the model above is not significant. In other words, the quadratic function in our model is not necessary. We can therefore estimated the new model without the quadratic function as shown below
knitr::include_graphics("rsm3.png")
model1 <- rsm(Y ~ FO(x1,x2), data = DoEOpt04)
summary(model1)
Call:
rsm(formula = Y ~ FO(x1, x2), data = DoEOpt04)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.444444 0.057288 705.9869 5.451e-16 ***
x1 0.775000 0.085932 9.0188 0.000104 ***
x2 0.325000 0.085932 3.7821 0.009158 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple R-squared: 0.941, Adjusted R-squared: 0.9213
F-statistic: 47.82 on 2 and 6 DF, p-value: 0.0002057
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
FO(x1, x2) 2 2.82500 1.41250 47.8213 0.0002057
Residuals 6 0.17722 0.02954
Lack of fit 2 0.00522 0.00261 0.0607 0.9419341
Pure error 4 0.17200 0.04300
Direction of steepest ascent (at radius 1):
x1 x2
0.9221944 0.3867267
Corresponding increment in original units:
Time Temp
4.610972 1.933633
knitr::include_graphics("rsm4.png")
The coefficient of the new model are significant with a higher R-square and a higher adjusted R-square as compare to the previous model. In other words, the new model is doing better than the previous one. Let’s now build the contour plot of the model above as a function of x1 and x2.
contour(model1, ~ x1+x2,
image = TRUE,
xlabs=c("Time (min)", "Temperature (ºC)"))
points(DoEOpt04$Time,DoEOpt04$Temp)
From the plot above, the yield increases as both temperature and time increases. However the yield seems to be far from the optimal. In other words, we do not see the region of the maximum yield. However we can see the path of the maximum increase in the yield. This is what we the direction of the steepest ascent. Consider the graph below
knitr::include_graphics("ascent.png")
In the section we are going to see how to calculate the direction of the steepest ascent and how to follow the direction of the steepest ascent to the vicinity of the maximum response.
In the contour plot, the direction of the steepest ascent is perpendicular to the contour lines as shown below;
knitr::include_graphics("ascent2.png")
This indicates the direction of the maximum increase in the response. The information in the RSM output tells the relative variation between the two factors. Now, in order to follow the path of the steepest ascent in this design, for each 4.6 min of increase in time, we should increase the temperature by 1.933 degrees.
knitr::include_graphics("ascent3.png")
The mathematical formular is as shown below
knitr::include_graphics("ascent4.png")
The graph above shows that if we increase time by 10 min, we will have to increase temperature by 4.1 degrees Celsius.
knitr::include_graphics("max.png")
The plot above shows that the condition around 85 minutes and the temperature of around 175 degrees Celcius is the optimal condition for the maximum increase in yield. In other words, these are the experimental space for the highest yields.Now, let us check these assumption by building an experimental design around this point. We can now use the 85 minutes and the temperature of 175 degrees Celsius as the central point. Remember the range of time and temperature for the experimental design can be chosen freely. Lets go with a range of 10 minutes and 10 degrees Celsius. When the reaction changes from 85 min to 90 minutes, temperature rise from 175 to 180 degrees Celsius. Consider the figure below
knitr::include_graphics("design.png")
From the figure above, the design must run as a single replicate in the factorial point and five replicate at the central point.
Design and analysis of experiments / Douglas C. Montgomery. — Eighth edition. ISBN 978-1-118-14692-7
(run this code to build the DoEOpt05 data set before following the analysis)
DoEOpt05 <- data.frame(x1 = c(-1,-1,1,1,0,0,0,0,0),
x2 = c(-1,1,-1,1,0,0,0,0,0),
Time = c(80,80,90,90,85,85,85,85,85),
Temp = c(170,180,170,180,175,175,175,175,175),
Y = c(76.5,77,78,79.5,79.9,80.3,80,79.7,79.8)
)
head(DoEOpt05,5)
x1 x2 Time Temp Y
1 -1 -1 80 170 76.5
2 -1 1 80 180 77.0
3 1 -1 90 170 78.0
4 1 1 90 180 79.5
5 0 0 85 175 79.9
View(DoEOpt05)
library(rsm)
str(DoEOpt05)
'data.frame': 9 obs. of 5 variables:
$ x1 : num -1 -1 1 1 0 0 0 0 0
$ x2 : num -1 1 -1 1 0 0 0 0 0
$ Time: num 80 80 90 90 85 85 85 85 85
$ Temp: num 170 180 170 180 175 175 175 175 175
$ Y : num 76.5 77 78 79.5 79.9 80.3 80 79.7 79.8
After checking the structure of the data, the next step is to assign the relationship between the coded and the natural variable using the command below. We will therefore assign the x1 and x2 as the coded data.
DoEOpt05 <- as.coded.data(DoEOpt05,
x1 ~ (Time-85)/5,
x2 ~ (Temp-175)/5)
str(DoEOpt05)
Classes 'coded.data' and 'data.frame': 9 obs. of 5 variables:
$ x1 : num -1 -1 1 1 0 0 0 0 0
$ x2 : num -1 1 -1 1 0 0 0 0 0
$ Time: num 80 80 90 90 85 85 85 85 85
$ Temp: num 170 180 170 180 175 175 175 175 175
$ Y : num 76.5 77 78 79.5 79.9 80.3 80 79.7 79.8
- attr(*, "codings")=List of 2
..$ x1:Class 'formula' language x1 ~ (Time - 85)/5
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
..$ x2:Class 'formula' language x2 ~ (Temp - 175)/5
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
- attr(*, "rsdes")=List of 2
..$ primary: chr [1:2] "x1" "x2"
..$ call : language as.coded.data(data = DoEOpt05, x1 ~ (Time - 85)/5, x2 ~ (Temp - 175)/5)
head(DoEOpt05,5)
Time Temp Time Temp Y
1 80 170 80 170 76.5
2 80 180 80 180 77.0
3 90 170 90 170 78.0
4 90 180 90 180 79.5
5 85 175 85 175 79.9
Data are stored in coded form using these coding formulas ...
x1 ~ (Time - 85)/5
x2 ~ (Temp - 175)/5
The next step is to run the regression model
model <- rsm(Y ~ FO(x1,x2) + TWI(x1,x2), data = DoEOpt05)
summary(model)
Call:
rsm(formula = Y ~ FO(x1, x2) + TWI(x1, x2), data = DoEOpt05)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78.96667 0.49148 160.6702 1.772e-10 ***
x1 1.00000 0.73722 1.3564 0.2330
x2 0.50000 0.73722 0.6782 0.5277
x1:x2 0.25000 0.73722 0.3391 0.7483
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple R-squared: 0.3257, Adjusted R-squared: -0.07891
F-statistic: 0.805 on 3 and 5 DF, p-value: 0.5426
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
FO(x1, x2) 2 5.000 2.500 1.150 0.3882680
TWI(x1, x2) 1 0.250 0.250 0.115 0.7483056
Residuals 5 10.870 2.174
Lack of fit 1 10.658 10.658 201.094 0.0001436
Pure error 4 0.212 0.053
Stationary point of response surface:
x1 x2
-2 -4
Stationary point in original units:
Time Temp
75 155
Eigenanalysis:
eigen() decomposition
$values
[1] 0.125 -0.125
$vectors
[,1] [,2]
x1 0.7071068 -0.7071068
x2 0.7071068 0.7071068
contour(model, ~ x1+x2,
image = TRUE,
xlabs=c("Time (min)", "Temperature (ºC)"))
points(DoEOpt05$Time,DoEOpt05$Temp)
From the model above, the results are pretty disappointing. Neither x1, x2 nor their interaction were significant. Besides, R-square and adjusted R-square are extremely low. Besides, the p-value shows that the model is not significant, and lack of fit was significant. Let’s take a keen look at the results and try to understand what is happening.
knitr::include_graphics("data.png")
From the results below, the effect of time from 77.0 to 79.5 is 2.5 while from 76.5 to 78 is 1.5. On the other hand, the effect of temperature from 76.5 to 77.0 is 0.5 while the effect of temperature from 78 to 79.5 is 1.5. The error variation among replicate point is 0.6. In other words, the effect of time and temperature seem to be higher than the intrinsic error.
knitr::include_graphics("data2.png")
The linear model with interaction is not adequate to describe the behavior of the data. We can also visualize this as shown below
plot(DoEOpt05$Time, DoEOpt05$Y, xlab = "Time", ylab = "Yield (%)")
plot(DoEOpt05$Temp, DoEOpt05$Y, xlab = "Temperature", ylab = "Yield (%)")
WE can see the non-linear behavior when we plot the experimental results against the regression variables for both time and temperature. The behavior strictly follows the second order model. The quadratic model with interaction factor effect has six regression parameters, however, our design has five independent runs since the replicates are not independent runs. We need at least one independent run for each regression parameter. The 2^k design with central point can check model linearity but cannot estimates the regression parameters. The following section will see the experimental design that can fit the second order model.
One of the Key ideas behind an experimental design is to provide the results distribution of the data in the region of interest. The most used approach in the second order model is the central composite design. Starting by the power of k-factorial design with central point, the new point are obtained by rotating the factorial points by 45 degrees over x1 and x2. The new experimental points are called axial points and the cartesian codes are combination of (0, -α) or (0, +α). Consider the figure below.
knitr::include_graphics("alpha.png")
Using two axial points and the central point, we can draw a triangle with hypotenuse = 2 and legs = α. The length of the hypotenuse is the difference between the high and low levels of the codes of the variable in between the power or K-design, which is equal to 2units. Consider the figure below
knitr::include_graphics("alpha2.png")
Using the well known Pythagoras theorem, we can easily determine the alpha value as 1.212 as shown below.
knitr::include_graphics("alpha3.png")
This way, the axial points are the combinations of 1.141 and 0) and -1.414 and 0. Consider the figure below.
knitr::include_graphics("alpha4.png")
This particular design is a spherical design with all experimental points equal distance from the design center. The experimental matrix of the central composite design consist of the following. Consider the figure below.
knitr::include_graphics("alpha5.png")
Let us now go back to our last design and argument it with axial point. We have to test the factorial design with central point and the relationship between the codes and the natural variables.
knitr::include_graphics("alpha6.png")
We can use this relationship to calculate the axial values of time and temperature. The axial points for time and temperatures are as shown below.
knitr::include_graphics("alpha7.png")
Before analyzing the response, let us visualize the treatment distribution with x1=time and x2= temperature. We can have a graphical visualiuzation of the experimental points.
knitr::include_graphics("alpha8.png")
However, this plot is not necessary, but is always a good to check the experimental points distribution. Lets us now analyze the data.
Design and analysis of experiments / Douglas C. Montgomery. — Eighth edition. ISBN 978-1-118-14692-7
(run this code to build the DoEOpt05 data set before following the analysis)
DoEOpt06 <- data.frame(x1 = c(-1, -1, 1, 1, 0, 0, 0, 0, 0, 1.414, -1.414, 0, 0),
x2 = c(-1, 1, -1, 1, 0, 0, 0, 0, 0, 0, 0, 1.414, -1.414),
Time = c(80, 80, 90, 90, 85, 85, 85, 85, 85, 92.07, 77.93, 85, 85),
Temp = c(170, 180, 170, 180, 175, 175, 175, 175, 175, 175, 175, 182.07, 167.93),
Y = c(76.5, 77, 78, 79.5, 79.9, 80.3, 80, 79.7, 79.8, 78.4, 75.6, 78.5, 77)
)
head(DoEOpt06,5)
x1 x2 Time Temp Y
1 -1 -1 80 170 76.5
2 -1 1 80 180 77.0
3 1 -1 90 170 78.0
4 1 1 90 180 79.5
5 0 0 85 175 79.9
tail(DoEOpt06)
x1 x2 Time Temp Y
8 0.000 0.000 85.00 175.00 79.7
9 0.000 0.000 85.00 175.00 79.8
10 1.414 0.000 92.07 175.00 78.4
11 -1.414 0.000 77.93 175.00 75.6
12 0.000 1.414 85.00 182.07 78.5
13 0.000 -1.414 85.00 167.93 77.0
library(rsm)
str(DoEOpt06)
'data.frame': 13 obs. of 5 variables:
$ x1 : num -1 -1 1 1 0 ...
$ x2 : num -1 1 -1 1 0 0 0 0 0 0 ...
$ Time: num 80 80 90 90 85 ...
$ Temp: num 170 180 170 180 175 175 175 175 175 175 ...
$ Y : num 76.5 77 78 79.5 79.9 80.3 80 79.7 79.8 78.4 ...
DoEOpt06 <- as.coded.data(DoEOpt06,
x1 ~ (Time-85)/5,
x2 ~ (Temp-175)/5)
knitr::include_graphics("model.png")
model_Y <- rsm(Y ~ SO(x1,x2), data = DoEOpt06)
summary(model_Y)
Call:
rsm(formula = Y ~ SO(x1, x2), data = DoEOpt06)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.939955 0.119089 671.2644 < 2.2e-16 ***
x1 0.995050 0.094155 10.5682 1.484e-05 ***
x2 0.515203 0.094155 5.4719 0.000934 ***
x1:x2 0.250000 0.133145 1.8777 0.102519
x1^2 -1.376449 0.100984 -13.6303 2.693e-06 ***
x2^2 -1.001336 0.100984 -9.9158 2.262e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple R-squared: 0.9827, Adjusted R-squared: 0.9704
F-statistic: 79.67 on 5 and 7 DF, p-value: 5.147e-06
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
FO(x1, x2) 2 10.0430 5.0215 70.8143 2.267e-05
TWI(x1, x2) 1 0.2500 0.2500 3.5256 0.1025
PQ(x1, x2) 2 17.9537 8.9769 126.5944 3.194e-06
Residuals 7 0.4964 0.0709
Lack of fit 3 0.2844 0.0948 1.7885 0.2886
Pure error 4 0.2120 0.0530
Stationary point of response surface:
x1 x2
0.3892304 0.3058466
Stationary point in original units:
Time Temp
86.94615 176.52923
Eigenanalysis:
eigen() decomposition
$values
[1] -0.9634986 -1.4142867
$vectors
[,1] [,2]
x1 -0.2897174 -0.9571122
x2 -0.9571122 0.2897174
The results shows that both linear terms are significant. Besides, all the quadratic terms are significant. This second order model has a higher R-sqare and a very low p-value, pretty good. Additionally, lack of fit was not significant.
knitr::include_graphics("model2.png")
model_Y <- rsm(Y ~ FO(x1,x2) + TWI(x1,x2) + PQ(x1,x2), data = DoEOpt06)
summary(model_Y)
Call:
rsm(formula = Y ~ FO(x1, x2) + TWI(x1, x2) + PQ(x1, x2), data = DoEOpt06)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.939955 0.119089 671.2644 < 2.2e-16 ***
x1 0.995050 0.094155 10.5682 1.484e-05 ***
x2 0.515203 0.094155 5.4719 0.000934 ***
x1:x2 0.250000 0.133145 1.8777 0.102519
x1^2 -1.376449 0.100984 -13.6303 2.693e-06 ***
x2^2 -1.001336 0.100984 -9.9158 2.262e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple R-squared: 0.9827, Adjusted R-squared: 0.9704
F-statistic: 79.67 on 5 and 7 DF, p-value: 5.147e-06
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
FO(x1, x2) 2 10.0430 5.0215 70.8143 2.267e-05
TWI(x1, x2) 1 0.2500 0.2500 3.5256 0.1025
PQ(x1, x2) 2 17.9537 8.9769 126.5944 3.194e-06
Residuals 7 0.4964 0.0709
Lack of fit 3 0.2844 0.0948 1.7885 0.2886
Pure error 4 0.2120 0.0530
Stationary point of response surface:
x1 x2
0.3892304 0.3058466
Stationary point in original units:
Time Temp
86.94615 176.52923
Eigenanalysis:
eigen() decomposition
$values
[1] -0.9634986 -1.4142867
$vectors
[,1] [,2]
x1 -0.2897174 -0.9571122
x2 -0.9571122 0.2897174
knitr::include_graphics("model2.png")
Let us polish our model by removing the interaction term because it was significant. We do that by removing the TWI from the model.
model_Y <- rsm(Y ~ FO(x1,x2) + PQ(x1,x2), data = DoEOpt06)
summary(model_Y)
Call:
rsm(formula = Y ~ FO(x1, x2) + PQ(x1, x2), data = DoEOpt06)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.93995 0.13660 585.2155 < 2.2e-16 ***
x1 0.99505 0.10800 9.2135 1.559e-05 ***
x2 0.51520 0.10800 4.7704 0.001408 **
x1^2 -1.37645 0.11583 -11.8831 2.310e-06 ***
x2^2 -1.00134 0.11583 -8.6447 2.489e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple R-squared: 0.974, Adjusted R-squared: 0.961
F-statistic: 75.02 on 4 and 8 DF, p-value: 2.226e-06
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
FO(x1, x2) 2 10.0430 5.0215 53.8227 2.290e-05
PQ(x1, x2) 2 17.9537 8.9769 96.2186 2.538e-06
Residuals 8 0.7464 0.0933
Lack of fit 4 0.5344 0.1336 2.5206 0.1962
Pure error 4 0.2120 0.0530
Stationary point of response surface:
x1 x2
0.3614555 0.2572577
Stationary point in original units:
Time Temp
86.80728 176.28629
Eigenanalysis:
eigen() decomposition
$values
[1] -1.001336 -1.376449
$vectors
[,1] [,2]
x1 0 -1
x2 -1 0
In the new model, all the regression coefficients are significant. The model still has a higher R-square and a very lower p-value than the one from the previous model. Pretty good. Besides, lack of fit was not significant.
knitr::include_graphics("model3.png")
Lets us now focus on the visualization of the results. For the results visualization, we are going to build both the contour plot and the perspective for the yields and time (x1) and temperature (x2). Consider the contour plot below ### contour and perspective plots
par(mfrow = c(1, 1))
contour(model_Y, x1~x2, image = TRUE,
xlabs=c("Time (min)", "Temperature (ºC)"))
persp(model_Y, x1~x2, col = terrain.colors(50), contours = "colors",
zlab = "Yield (%)",
xlabs=c("Time (min)", "Temperature (ºC)"))
The contour and persp curve above are quite different from the ones we have analyzing from the previous linear models and interactions. Quadratic model can have have both maximum and minimum set of points. In this case, the plots shows that the reaction presents a set of condition, a specific combination of time and temperature that results in the maximum yields. The precise determination of the maximum point use the response surface equation. The maximum condition if it exists.
knitr::include_graphics("response2.png")
knitr::include_graphics("response.png")
The maximum condition if it exits, will be the set of x1 and x2 for which the partial derivative of delta y/ delta x1 = delta y/x2 = 0. Consider the equation if the figure below. This is called the stationary point.
knitr::include_graphics("condition.png")
Fortunately, with the new method, we do not need to derive the equation ourselves but R does for us. We can see the stationary point coordinates in both in coded and natural variables in the RSM output. In this case. The maximum yields occurs at 86.6 minutes and 176.6 degrees Celsius. Consider the figure below.
knitr::include_graphics("maxyield.png")
This experimental design has led us to the optimum condition for the maximum yield. The last question we should be asking ourselves is, “how much is the maximum predicted yield?” We can use the predicted function calculated. First, we are going to create a data frame called max with the coded coordinates of the states of stationary point and then use the predict function to calculate the response of the model for the max data frame. Let us run it. Consider the codes below
max <- data.frame(x1 = 0.361, x2 = 0.257)
predict(model_Y, max)
1
80.18606
From the results above, the predicted yield is 80.2%. Now, one interesting follow up is to run the experiment using the predicted optimized condition to verify the model’s prediction.