Tutorial 3: Plant Growth Data

Hide all messages and warnings for the RMarkdown file.

knitr::opts_chunk$set(message = FALSE, warning = FALSE)

Describe and Summarise the Plant Growth Dataset

Load the dataset.

The data is available directly in RStudio.

{data(PlantGrowth)}

PlantGrowth is results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control(ctrl) and two different treatment conditions (trt1 and trt2).

Load required packages to analyse the data.

library(dplyr)
library(tidyverse)
library(ggplot2)

Summary of Plant Growth

Each treatment group has 10 data points, and the summary is calculated using the weights of plants from all 3 treatment groups.

summary(PlantGrowth)

##      weight       group   
##  Min.   :3.590   ctrl:10  
##  1st Qu.:4.550   trt1:10  
##  Median :5.155   trt2:10  
##  Mean   :5.073            
##  3rd Qu.:5.530            
##  Max.   :6.310

Description of Plant Growth

The ‘head’ syntax shows the first 6 rows of data from PlantGrowth. This allows us to see that the data is formatted in 2 columns titled ‘weight’ and ‘group’.

head(PlantGrowth)

##   weight group
## 1   4.17  ctrl
## 2   5.58  ctrl
## 3   5.18  ctrl
## 4   6.11  ctrl
## 5   4.50  ctrl
## 6   4.61  ctrl

str shows us the structure of the data. It is in a 30x2 dataframe with ‘Weight’ as numerical data and ‘Group’ as Factor with 3 categories.

str(PlantGrowth)

## 'data.frame':    30 obs. of  2 variables:
##  $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
##  $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

QQplot of PlantGrowth to check the normality of the data. It seems like it is mostly normal with the points following the reference line of a normal distribution.

qqnorm(PlantGrowth$weight, main="QQ plot of Plant Growth")
qqline(PlantGrowth$weight, col="red")

Do different treatments affect plant weight?

A graph is plotted using ggplot to compare the three different treatments.

PlantGrowth %>% ggplot() +
                # Set aesthetics (x, y, fill)
                geom_histogram(aes(x = weight,
                                   fill = group),
                               alpha = 0.5, bins = 15, color = "black") +
                # Split the graph by species
                facet_grid(rows = vars(group)) +
                # Format axis labels
                labs(x = "Weight (g)", y = "Count",
                     title = "Weight of Plants by Treatment Group") +
                # Add colours manually
                scale_fill_manual(values = c("darkorange",
                                             "purple",
                                             "cyan4")) +
                # Set plot theme
                theme_minimal() +
                # Remove legend
                theme(legend.position = "none")

From the histogram, we can see the spread of data across the different weights. The control has an even spread throughout the different weights. Treatment 1 mostly has lower plant weights with the data on the left and two outliers, and treatment 2 has much higher plant weights with the data on the right of the graph.

PlantGrowth %>% ggplot() +
                geom_boxplot(aes(x = group, y = weight,
                               colour = group),
                           size = 1, outlier.color = "red") +
                labs(x = "Treatment Group", y = "Weight (g)",
                     title = "Boxplot of Plants by Treatment Group") +
                scale_colour_manual(values = c("darkorange",
                                             "purple",
                                             "cyan4")) +
                theme_minimal() +
                theme(legend.position = "none")

The boxplot also allows us to confirm the outliers; there are 2 outliers for treatment group 1, and the box for treatment 2 is much higher compared to treatment 1 and the control. Hence, different treatments does affect plant weight.

Which treatment has the highest average growth?

Create a dataframe with mean and confidence intervals

The mean, sum of squared deviations (ssd), standard error (se), lower and upper confidence intervals are calculated per treatment type and placed into a dataframe in order to be retrieved later on.

# Using dplyr functions
plantgrowth_summary <- 
  PlantGrowth %>% 
     group_by(group) %>%
     summarise(count = n(),
               mean = mean(weight, na.rm = TRUE),
               ssd = sd(weight, na.rm = TRUE)) %>%
     mutate(se = ssd / sqrt(count),
            lower_ci = mean - qt(1 - (0.05 / 2), count - 1) * se,
            upper_ci = mean + qt(1 - (0.05 / 2), count - 1) * se)

# Nice table output
knitr::kable(plantgrowth_summary)

group	count	mean	ssd	se	lower_ci	upper_ci
ctrl	10	5.032	0.5830914	0.1843897	4.614882	5.449118
trt1	10	4.661	0.7936757	0.2509823	4.093239	5.228761
trt2	10	5.526	0.4425733	0.1399540	5.209402	5.842598

Plot the graph

The summary dataframe created above is now used to find out which treatment has the highest average growth and plotted into a graph for visual comparison.

plantgrowth_summary %>% ggplot() +
                geom_point(aes(x = group, y = mean,
                               colour = group),
                           size = 3) +
                geom_errorbar(aes(x = group,
                                  ymin = lower_ci,
                                  ymax = upper_ci,
                                  colour = group),
                              width = 0.2, linewidth = 1) +
                labs(x = "Treatment Group", y = "Weight (g)",
                     title = "Mean Weight of Plants by Treatment Group") +
                scale_colour_manual(values = c("darkorange",
                                             "purple",
                                             "cyan4")) +
                theme_minimal() +
                theme(legend.position = "none")

At a glance, treatment 2 has the highest average growth, followed by the control and finally treatment 1. The lines extending from the data point are the confidence intervals, whereby there is a 95% confidence level that the true population mean weight falls within the range shown.