Reading time: 12 minutes
The main purpose of this project is to demonstrate my experimentation and analytical skills in R and SQL using a dataset that I collected from a maize experiment during my university program from the University of Queensland in 2015. This personal project improves and reforges the original project with more data cleaning, data manipulation, better visualisations, and statistical tests.
The maize study had a CRD system (completely randomised design) and was carried out in the glasshouse of the university. There were 6 treatments, and 9 replicates for each treatment. Each treatment has soil nutrient content different from each other. Plant weight was the responding variable. All maize crops were harvested, oven dried, and weighted during 3 harvest periods. 2 visualisations were synthesised to observe overall trend, Q-Q plot and Shapiro-Wilk test were used to test for residual normality, Levene’s test was applied to test for group variances, Kruskal-wallis and Dunn’s test were selected as the omnibus and post-hoc test.
Result reveals that nitrogen and phosphorus are the most limiting nutrients among the treatments in the experiment. Maize plants in soils that do not have nitrogen and phosphorus had the significantly lower growth compared to other treatments. Removing potassium and sulphur affects plant growth as well but the growth was still significantly higher than nitrogen and phosphorus limited soil. However, the growth of plants in soil that has potassium and sulphur removed was still significantly lower than the soils that have the balanced nutrient.
Highlights
R packages loaded in this project include tidyverse packages (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats), skimr, kabbleExtra, ggrepel, qqplotr, DescTools, and dunn.test.
library(tidyverse)
library(skimr)
library(kableExtra)
library(ggrepel)
library(qqplotr)
library(DescTools)
library(dunn.test)
Plants require a variety of nutrients for healthy growth and yield. The goal of this project is to investigate which of the 6 plant-essential soil nutrients affects plant growth the most. This experiment was actually mine, and has already been carried out in a glasshouse in the University of Queensland (UQ) Gatton campus in 2015, I am revisiting it and reforging it with relevant skills in R, SQL, and statistical analysis.
It was a potting experiment, maize was planted in various pots that have different nutrient contents. The glasshouse has a controlled, homogeneous environment, it was designed and built for experimentation purposes. There were 54 pots allocated for the experiment. There were 6 treatments in the experiment. Each treatment had 9 replicates. In the experiment, the position of each pot was completely randomised in the allocated bench in the glasshouse. It was a 75-days experiment, randomisation was carried out at least 3 times each week during watering.
Each of the treatments are:
Treatment <- c("T1",
"T2",
"T3",
"T4",
"T5",
"T6")
Description <- c("Optimal nutrient content",
"Optimal nutrient content x 2",
"nitrogen deficiency (-N) ",
"phosphorus deficiency (-P)",
"potassium deficiency (-K)",
"sulphur deficiency (-S)")
Detail <- c("Nutrients added are balanced",
"Doubling the nutrient content in T1",
"N is important for cell devision, photosynthesis, and act as a building block of amino acids for plants (William and Mattson (1980).)",
"P is important For plant photosynthesis, root formation, and growth (Hameeda et al. 2006)",
"K is an important element that improves nutrient intake of plants and to enhance disease resistance (Wu et al .2004)",
"S helps to develop essential enzymes and vitamins for plants")
data.frame(Treatment, Description, Detail) %>%
kbl() %>%
kable_classic("hover", "border")
| Treatment | Description | Detail |
|---|---|---|
| T1 | Optimal nutrient content | Nutrients added are balanced |
| T2 | Optimal nutrient content x 2 | Doubling the nutrient content in T1 |
| T3 | nitrogen deficiency (-N) | N is important for cell devision, photosynthesis, and act as a building block of amino acids for plants (William and Mattson (1980).) |
| T4 | phosphorus deficiency (-P) | P is important For plant photosynthesis, root formation, and growth (Hameeda et al. 2006) |
| T5 | potassium deficiency (-K) | K is an important element that improves nutrient intake of plants and to enhance disease resistance (Wu et al .2004) |
| T6 | sulphur deficiency (-S) | S helps to develop essential enzymes and vitamins for plants |
There were 5 maize plants planted in each pot. All plants were seeded in the same time. First harvest was 30 days after the day of seeding, second harvest happened 21 days after the first harvest, and the third harvest happened 21 days after the second harvest. Plants were weighted and recorded during the 3 harvest days. During each time of harvest, harvested plants were oven dried at 60°C for 48 hours and the dry weights were recorded.
It was a CRD system, controlling environmental variation is very crucial to the experiment result. Environmental noises had been taken care of by having the experiment carried out in the glasshouse that has the common environmental aspects controlled as well as under the cares of mine and my team. Maximum care had been given to each of this uncontrollable factors to reduces these extraneous noises, for example, the type of instruments used and observers’ fatigue. The environmental error should has been minimized to only uncontrollable one, include inherent generic variances between individual plant.
The maize dataset has been uploaded to BigQuery by me. BigQuery is an online database that store and allow users to work with stored datasets with SQL programming language.
In this section, I search the maize dataset on the BigQuery, clean it (though I can also complete the same cleaning in R), download it, and upload it onto R. If you are unfamiliar with BigQuery, you can just follow my description.
Following is the BigQuery database showing the maize dataset.
The maize dataset has columns “Harvest”, “Treatments”, “rep”, “Dwt_g”, “string_field_4”, and “int64_field_5”. Several cleaning tasks I have identified, include:
From <- c("T1", "T2", "T3", "T4", "T5", "T6")
To <- c("C (recall: balanced nutrients added)",
"2C (recall: double the nutrients added to C)",
"N_removed",
"O_removed",
"K_removed",
"S_removed")
data.frame(From, To) %>%
kbl() %>%
kable_styling(bootstrap_options = c("hover", "bordered", "stripped"))
| From | To |
|---|---|
| T1 | C (recall: balanced nutrients added) |
| T2 | 2C (recall: double the nutrients added to C) |
| T3 | N_removed |
| T4 | O_removed |
| T5 | K_removed |
| T6 | S_removed |
Following SQL code complete all the cleaning tasks at once.
After cleaning, the dataset is downloaded to my relevant local file.
Following code uploaded the SQL-cleaned maize dataset onto R (click the right button). Following table is the result of successful data import.
maize <- read_csv("maize_SQL_cleaned.csv")
maize
The dataset has 162 rows of data and 4 columns of variables. There are 2 character variables, the “harv” (harvest) and the “trt” (treatment), and 2 numerical variables, the “rep” (replication) and “Dwt_g” (dry weight, gram).
skim_without_charts(maize)
| Name | maize |
| Number of rows | 162 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| harv | 0 | 1 | 5 | 6 | 0 | 3 | 0 |
| trt | 0 | 1 | 1 | 9 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| rep | 0 | 1 | 5.00 | 2.59 | 1.0 | 3.0 | 5.0 | 7.00 | 9 |
| Dwt_g | 0 | 1 | 5.17 | 10.22 | 0.1 | 0.2 | 0.6 | 5.35 | 69 |
Insights:
The dataset has been cleaned, however, 2 of the character variables need to be changed to factor type because they are categorical data that can be used to categorise and grouping the data during analysis.
It is an important feature for data analysis using programming language. Additionally, change the type from character to factor can help R to process the data faster, though it is a small dataset to observe the improved processing speed.
Following code complete the conversion.
maize <- maize %>%
mutate_if(is_character, factor)
Glimpse the results:
glimpse(maize)
## Rows: 162
## Columns: 4
## $ harv <fct> first, first, first, first, first, first, first, first, first, f~
## $ trt <fct> C, C, C, C, C, C, C, C, C, C2, C2, C2, C2, C2, C2, C2, C2, C2, N~
## $ rep <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4~
## $ Dwt_g <dbl> 0.2, 0.1, 0.2, 0.1, 0.3, 0.1, 0.4, 0.5, 0.1, 0.2, 0.1, 0.1, 0.5,~
The variables “harv” and “trt” have been successfully converted from character (chr) to factors (fct).
The “dlb” of variables “rep” and “Dwt_g” stands for “double”, which is a R data type used to label numerical data that is either integer or having decimal places. So, the label of “dbl” of both of the numerical variables is correct.
Summary of the maize dataset shows a quick snapshot of the dataset. The variable “harv” has 54 samples size in each of its level, “trt” has 27 sample size in each of its level, and the dry weight gain (“Dwt_g”) of plants can be ranged from a minimum of 0.1 g to 69 g, with a overall mean of 5.167 g, and a very low overall median of merely 0.6 g. The 0.1 g should belong to the data in the first harvest, and the maximum of 69 g should be in the data recorded during final harvest.
summary(maize)
## harv trt rep Dwt_g
## first :54 C :27 Min. :1 Min. : 0.100
## second:54 C2 :27 1st Qu.:3 1st Qu.: 0.200
## third :54 K_removed:27 Median :5 Median : 0.600
## N_removed:27 Mean :5 Mean : 5.167
## P_removed:27 3rd Qu.:7 3rd Qu.: 5.350
## S_removed:27 Max. :9 Max. :69.000
Figure 1 shows that:
ggplot(maize, aes(x = harv, y = Dwt_g, fill = trt, colour = trt)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
stat_boxplot(geom = "errorbar") +
geom_jitter(position = position_jitterdodge()) +
stat_summary(fun.y = mean, geom = "point", size = 4, shape = 4, position = position_jitterdodge(0), color = "blue") +
theme_bw() +
labs(x = "Harvest",
y = "Plant Dry Weight, g",
title = "Figure 1: Plant Weight gained over 3 harvest periods",
subtitle = "Interval between harvest: 21 days"
)
Figure 2 shows insight that plant growth can be heavily limited by the deficiency of nutrient N (Nitrogen), P (Phosphorus), and K (Potassium).
df2 <- maize %>%
group_by(harv, trt) %>%
summarise(average = mean(Dwt_g))
ggplot(df2, aes(x = harv, y = average, group = trt, colour = trt)) +
geom_path(size = 1.5) +
geom_point(size = 3) +
labs(x = "Harvest",
y = "Average Weight Gained, g",
title = "Figure 2: Plant weight gain of different treatments over 3 harvest periods") +
geom_label_repel(data = subset(df2, harv == "third"),
aes(label = trt),
hjust = -0.1) +
theme_bw() +
theme(legend.position = "none")
During the first harvest, the plants were 30 days old. Lets see are there differences between treatments.
Assumptions testing
# set up df
df7.1 <- maize %>%
filter(harv == "first")
model1 <- aov(Dwt_g ~ trt, data = df7.1)
df7.1$resid <- resid(model1)
ggplot(df7.1, aes(sample = resid)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point() +
ggtitle("Q-Q plot of residues of First Harvest")
shapiro.test(model1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.80213, p-value = 4.577e-07
LeveneTest(Dwt_g ~ trt, data = df7.1)
Omnibus test
kruskal.test(df7.1$Dwt_g, df7.1$trt)
##
## Kruskal-Wallis rank sum test
##
## data: df7.1$Dwt_g and df7.1$trt
## Kruskal-Wallis chi-squared = 8.9877, df = 5, p-value = 0.1096
Summary
Although the statistical result shows that there is no statistical difference between each treatment, however a boxplot shows there there is insignificant higher plant weights in C, C2 and S_removed. Plants in these treatments seem to start growing differently and better off from the other 3 groups.
ggplot(df7.1, aes(x = trt, y = Dwt_g)) +
geom_boxplot() +
labs(title = "First Harvest Dry Weight",
x = "treatment",
y = "Dry weight, g") +
theme(plot.title = element_text(face = "bold"))
During the second harvest, the plants were 51 days old.
Assumptions testing
df7.2 <- maize %>%
filter(harv == "second")
model2 <- aov(Dwt_g ~ trt, data = df7.2)
df7.2$resid <- resid(model2)
ggplot(df7.2, aes(sample = resid)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point() +
ggtitle("Q-Q plot of residues of Second Harvest")
shapiro.test(model2$residual)
##
## Shapiro-Wilk normality test
##
## data: model2$residual
## W = 0.89546, p-value = 0.0001984
LeveneTest(Dwt_g ~ trt, data = df7.2)
Omnibus test
kruskal.test(df7.2$Dwt_g, df7.2$trt)
##
## Kruskal-Wallis rank sum test
##
## data: df7.2$Dwt_g and df7.2$trt
## Kruskal-Wallis chi-squared = 30.246, df = 5, p-value = 1.319e-05
Post-hoc test
Since the omnibus test is saying there is significant difference between treatment grous, Dunn’s test is selected based on the results of assumptions test as the appropriate post-hoc analysis method.
DunnTest(df7.2$Dwt_g ~ df7.2$trt)
##
## Dunn's test of multiple comparisons using rank sums : holm
##
## mean.rank.diff pval
## C2-C -1.444444 1.00000
## K_removed-C -10.333333 1.00000
## N_removed-C -31.944444 0.00024 ***
## P_removed-C -25.055556 0.00922 **
## S_removed-C -9.222222 1.00000
## K_removed-C2 -8.888889 1.00000
## N_removed-C2 -30.500000 0.00053 ***
## P_removed-C2 -23.611111 0.01703 *
## S_removed-C2 -7.777778 1.00000
## N_removed-K_removed -21.611111 0.03496 *
## P_removed-K_removed -14.722222 0.37319
## S_removed-K_removed 1.111111 1.00000
## P_removed-N_removed 6.888889 1.00000
## S_removed-N_removed 22.722222 0.02350 *
## S_removed-P_removed 15.833333 0.29147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Insights from the statistical test show that plant weight gained in:
The differences can be easily visualized in following graph. Noted: The letters show statistical differences were manually synthesized by hand using the above result. It is a common practice in statistics world, and the letters that overlap each other alphabetically show no statistical difference between each other or otherwise.
df7.2 <- df7.2 %>%
mutate(stat.diff = fct_collapse(trt,
"A" = c("C", "C2"),
"AB" = c("S_removed", "K_removed"),
"C" = "N_removed",
"BC" = "P_removed"
))
ggplot(df7.2, aes(x = trt, y = Dwt_g)) +
geom_boxplot() +
labs(title = "Second Harvest Dry Weight",
x = "treatment",
y = "Dry weight, g",
subtitle = "Overlap alphabet = no statistical different") +
theme(legend.position = "none",
plot.title = element_text(face = "bold")) +
geom_text(data = df7.2 %>% group_by(trt) %>% mutate(Dwt_g = max(Dwt_g)),
aes(label = stat.diff, colour = stat.diff),
vjust = -.5,
size = 5) +
scale_y_continuous(lim = c(0, 15))
Treatment C and C2 have plant dry weights that are significantly higher than N_removed and P_removed. Though Treatment C and C2 not significantly higher than K_removed and S_removed treatment, but are visually higher than these two treatment groups.
Statistical differences between treatments had started to appear in second harvest (51 days old) compared to the results in first harvest (30 days old). I am expecting the differences become larger in third harvest.
Assumptions testing
df7.3 <- maize %>%
filter(harv == "third")
model3 <- aov(Dwt_g ~ trt, data = df7.3)
df7.3$resid <- model3$residuals
ggplot(df7.3, aes(sample = resid)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point() +
ggtitle("Q-Q plot of residues of Third Harvest")
shapiro.test(resid(model3))
##
## Shapiro-Wilk normality test
##
## data: resid(model3)
## W = 0.84854, p-value = 7.244e-06
LeveneTest(df7.3$Dwt_g, df7.3$trt)
Omnibus
kruskal.test(df7.3$Dwt_g ~ df7.3$trt)
##
## Kruskal-Wallis rank sum test
##
## data: df7.3$Dwt_g by df7.3$trt
## Kruskal-Wallis chi-squared = 45.409, df = 5, p-value = 1.198e-08
Post-hoc test
Since the omnibus test is saying there is significant difference between group, Dunn’s test is selected based on the assumptions test results as the appropriate post-hoc analysis method.
DunnTest(df7.3$Dwt_g ~ df7.3$trt)
##
## Dunn's test of multiple comparisons using rank sums : holm
##
## mean.rank.diff pval
## C2-C -3.888889 0.96956
## K_removed-C -20.222222 0.06377 .
## N_removed-C -32.277778 0.00017 ***
## P_removed-C -40.055556 9.8e-07 ***
## S_removed-C -12.555556 0.54204
## K_removed-C2 -16.333333 0.19307
## N_removed-C2 -28.388889 0.00154 **
## P_removed-C2 -36.166667 1.5e-05 ***
## S_removed-C2 -8.666667 0.96956
## N_removed-K_removed -12.055556 0.54204
## P_removed-K_removed -19.833333 0.06720 .
## S_removed-K_removed 7.666667 0.96956
## P_removed-N_removed -7.777778 0.96956
## S_removed-N_removed 19.722222 0.06720 .
## S_removed-P_removed 27.500000 0.00228 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The differences can be easily visualized in following graph. Again, the letters show statistical differences were manually synthesized by hand using the above result. It is a common practice, and the letters that overlap each other alphabetically indicate no statistical difference or otherwise.
df7.3 <- df7.3 %>%
mutate(stat_diff = fct_collapse(trt,
"A" = "C",
"AB" = c("C2", "S_removed"),
"BC" = "K_removed",
"CD" = "N_removed",
"D" = "P_removed"))
ggplot(df7.3, aes(x = trt, y = Dwt_g)) +
geom_boxplot() +
geom_text(data = df7.3 %>% group_by(trt) %>% mutate(Dwt_g = max(Dwt_g)),
aes(label = stat_diff, colour = stat_diff),
vjust = -.5,
size = 5) +
labs(title = "Third Harvest Dry Weight",
x = "treatment",
y = "Dry weight, g",
subtitle = "Overlap alphabet = no statistical different") +
theme(legend.position = "none",
plot.title = element_text(face = "bold")) +
scale_y_continuous(lim = c(0, 80))
Statistical differences between treatments in the third harvest (72 days old) remain the same as second harvest (51 days old).
However, if looking closely, the boxplot median of C has become slightly between than C2, it is not affected by the outlier near 70g. Alphabetically, it has become “A” now as compared the “AB” of C2. Treatment C is moving slightly ahead. Remember, both treatments C and C2 were both “A” during the second harvest. In the third harvest, the growth of maize in N_removed and P_removed pots have become more limiting.
All measured data have been checked for residual normality with Q-Q plot and Shapiro-Wilk test in R. Variances among data were checked by Levene’s test. Data were analysed by appropriate non-parametric omnibus and post-hoc tests using Kruskal-Wallis test and Dunn’s test.
Treatment <- c("T1 - C",
"T2 - C2",
"T3 - K_removed",
"T4 - N_removed",
"T5 - P_removed",
"T6- S_removed")
First_harvest <- c(rep("-", 6))
Second_harvest <- c("A", "A", "AB", "C", "BC", "AB")
Third_harvest <- c("A", "AB", "BC", "CD", "D", "AB")
data.frame(Treatment, First_harvest, Second_harvest, Third_harvest) %>%
kbl(align = "c",
caption = "Letters with the same letter or overlapped are not significantly different (Dunn’s test, P = 0.05). All treatments had no statistical difference in the first harvest.") %>%
kable_classic("hover")
| Treatment | First_harvest | Second_harvest | Third_harvest |
|---|---|---|---|
| T1 - C |
|
A | A |
| T2 - C2 |
|
A | AB |
| T3 - K_removed |
|
AB | BC |
| T4 - N_removed |
|
C | CD |
| T5 - P_removed |
|
BC | D |
| T6- S_removed |
|
AB | AB |
Note that the results from Dunn test is the comparison of all groups with the control groups to test are there significantly differences among treatment groups. The statistical differences between group can be slightly varied from the above results. However, the results have been observed appropriate based on the visualisation of the boxplot.
Results show that Treatment 1 and 2 that have the balanced nutrient and doubled concentration of balanced nutrient added had the highest yield.
Treatment 3 and treatment 6 that have the soil element K (potassium) and S (sulphur) removed ranked the second group in plant growth during the second and third harvest. Base on the data, if longer days of cultivation is allowed, we may start to see statistical differences between these 2 treatments with treatment 1 and treatment 2.
Lastly, maize with N (nitrogen) and P (phosphorus) nutrients removed could not grow, with a statistically proven results. The 2 elements are limiting nutrients and both required by the plants to grow.
Doubling the nutrient content in the treatment 2 do not significantly boost plant growth compared to treatment 1 that has the balanced nutrient content added. Excessive fertilization may lead to undesirable economical and environmental outcomes.
This is a personal project created and designed for skill demonstration and non-commercial use only.
Dave 2015, Female inflorescence, with young silk, viewed 08 July 2021, https://en.wikipedia.org/wiki/Maize#/media/File:Cornsilk_7091.jpg
Hameeda, B, Harini, G, Rupela, O, Wani, S & Reddy, G 2008, ‘Growth promotion of maize by phosphate-solubilizing bacteria isolated from composts and macrofauna’, Microbiological Research, vol. 163, no. 2, pp.234-242.
William, J & Mattson, Jr 1980, ‘Herbivory in relation to plant nitrogen content’, Annual Review of Ecology and Systematics, vol. 11, pp. 119-161.
Wu, S, Cao, Z, Li, Z, Cheung, K & Wong, M 2005, ‘Effects of biofertilizer containing N-fixer, P and K solubilizers and AM fungi on maize growth: a greenhouse trial’, Geoderma, vol. 125, no. 1-2, pp.155-166.