Advanced Data Wrangling

Data Wrangling is one of the key aspects of Data analysis. The tidyverse package is a very useful tool for data pre-processing.

Pivoting

Pivoting is usually needed to re-shape data for ease of data entry, data comparison and data summary.

pivot_longer()

pivot_longer() makes data sets longer than the original data set by increasing the number of rows and decreasing the number of columns.

library(tidyverse)
dat <- read_csv("australia.soybean.csv")

## Rows: 464 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): env, loc, gen
## dbl (7): year, yield, height, lodging, size, protein, oil
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dat

## # A tibble: 464 × 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # ℹ 454 more rows

dat_longer <- dat %>% 
  pivot_longer(yield:lodging, names_to = "trait", values_to = "values")
dat_longer

## # A tibble: 1,392 × 9
##    env   loc    year gen    size protein   oil trait   values
##    <chr> <chr> <dbl> <chr> <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1 L70   Lawes  1970 G01    8.45    36.7  20.9 yield     2.39
##  2 L70   Lawes  1970 G01    8.45    36.7  20.9 height    1.44
##  3 L70   Lawes  1970 G01    8.45    36.7  20.9 lodging   4.25
##  4 L70   Lawes  1970 G02    9.95    37.6  20.7 yield     2.28
##  5 L70   Lawes  1970 G02    9.95    37.6  20.7 height    1.45
##  6 L70   Lawes  1970 G02    9.95    37.6  20.7 lodging   4.25
##  7 L70   Lawes  1970 G03   10.8     37.8  21.3 yield     2.57
##  8 L70   Lawes  1970 G03   10.8     37.8  21.3 height    1.46
##  9 L70   Lawes  1970 G03   10.8     37.8  21.3 lodging   3.75
## 10 L70   Lawes  1970 G04   10.0     38.4  22.0 yield     2.88
## # ℹ 1,382 more rows

pivot_wider()

pivot_wider() makes a data set wider by increasing the number of columns and decreasing the number of rows.

data <- read_csv("data/Longer.csv")

## Rows: 696 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): env, loc, gen, trait
## dbl (2): year, values
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data

## # A tibble: 696 × 6
##    env   loc    year gen   trait   values
##    <chr> <chr> <dbl> <chr> <chr>    <dbl>
##  1 L70   Lawes  1970 G01   yield     2.39
##  2 L70   Lawes  1970 G01   height    1.44
##  3 L70   Lawes  1970 G01   lodging   4.25
##  4 L70   Lawes  1970 G02   yield     2.28
##  5 L70   Lawes  1970 G02   height    1.45
##  6 L70   Lawes  1970 G02   lodging   4.25
##  7 L70   Lawes  1970 G03   yield     2.57
##  8 L70   Lawes  1970 G03   height    1.46
##  9 L70   Lawes  1970 G03   lodging   3.75
## 10 L70   Lawes  1970 G04   yield     2.88
## # ℹ 686 more rows

data.longer <- data %>% 
  pivot_wider(names_from = "trait", values_from = "values")
data.longer

## # A tibble: 232 × 7
##    env   loc    year gen   yield height lodging
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5 
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5 
##  6 L70   Lawes  1970 G06    2.41  1.36     4   
##  7 L70   Lawes  1970 G07    2.70  1.3      3   
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25
##  9 L70   Lawes  1970 G09    2.57  1.03     3   
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75
## # ℹ 222 more rows

Combining Tables

It is sometimes necessary to join data sets together when they are from different sources but one table is needed for analysis.
The left_join(), right_join(), inner_join() and full_join() functions are used to carry out combining of tables.

Practical Example

library(readxl)
df1 <- read_excel("data/df.combine.xlsx", sheet = "df1")
df2 <- read_excel("data/df.combine.xlsx", sheet= "df2")
df_left <-  left_join(df1,df2)

## Joining with `by = join_by(gen)`

df_left

## # A tibble: 15 × 6
##    gen   yield height  size protein   oil
##    <chr> <dbl>  <dbl> <dbl>   <dbl> <dbl>
##  1 G01    2.79  0.97   8.35    39.4  20.9
##  2 G02    2.62  0.785  9.75    40.6  20.7
##  3 G03    2.48  0.955 11.4     37.4  21.3
##  4 G04    2.92  0.76  11.1     36.9  22.0
##  5 G05    2.14  0.76  11.9     38.7  22.1
##  6 G06    2.86  0.65  12.2     38.1  21.2
##  7 G07    2.97  0.8   11.2     37.4  21.7
##  8 G08    1.65  0.9   11.8     36.6  21.1
##  9 G09    3.1   0.825 12.2     37.1  21.5
## 10 G10    3.00  0.785 12.0     37.8  20.4
## 11 G11    1.90  0.875  7.4     43.4  NA  
## 12 G12    1.64  1.16   7.45    44.4  NA  
## 13 G13    1.97  1.08   8.25    42.2  NA  
## 14 G14    2.45  0.865  7.65    38.4  NA  
## 15 G15    2.63  0.675  9.05    40.7  NA

df_right <- right_join(df1,df2)

## Joining with `by = join_by(gen)`

df_right

## # A tibble: 15 × 6
##    gen   yield height  size protein   oil
##    <chr> <dbl>  <dbl> <dbl>   <dbl> <dbl>
##  1 G01    2.79  0.97   8.35    39.4  20.9
##  2 G02    2.62  0.785  9.75    40.6  20.7
##  3 G03    2.48  0.955 11.4     37.4  21.3
##  4 G04    2.92  0.76  11.1     36.9  22.0
##  5 G05    2.14  0.76  11.9     38.7  22.1
##  6 G06    2.86  0.65  12.2     38.1  21.2
##  7 G07    2.97  0.8   11.2     37.4  21.7
##  8 G08    1.65  0.9   11.8     36.6  21.1
##  9 G09    3.1   0.825 12.2     37.1  21.5
## 10 G10    3.00  0.785 12.0     37.8  20.4
## 11 G16   NA    NA     NA       NA    19.1
## 12 G17   NA    NA     NA       NA    18.7
## 13 G18   NA    NA     NA       NA    19.2
## 14 G19   NA    NA     NA       NA    20.8
## 15 G20   NA    NA     NA       NA    20.7

df_inner <- inner_join(df1,df2)

## Joining with `by = join_by(gen)`

df_inner

## # A tibble: 10 × 6
##    gen   yield height  size protein   oil
##    <chr> <dbl>  <dbl> <dbl>   <dbl> <dbl>
##  1 G01    2.79  0.97   8.35    39.4  20.9
##  2 G02    2.62  0.785  9.75    40.6  20.7
##  3 G03    2.48  0.955 11.4     37.4  21.3
##  4 G04    2.92  0.76  11.1     36.9  22.0
##  5 G05    2.14  0.76  11.9     38.7  22.1
##  6 G06    2.86  0.65  12.2     38.1  21.2
##  7 G07    2.97  0.8   11.2     37.4  21.7
##  8 G08    1.65  0.9   11.8     36.6  21.1
##  9 G09    3.1   0.825 12.2     37.1  21.5
## 10 G10    3.00  0.785 12.0     37.8  20.4

df_full <- full_join(df1, df2)

## Joining with `by = join_by(gen)`

df_full

## # A tibble: 20 × 6
##    gen   yield height  size protein   oil
##    <chr> <dbl>  <dbl> <dbl>   <dbl> <dbl>
##  1 G01    2.79  0.97   8.35    39.4  20.9
##  2 G02    2.62  0.785  9.75    40.6  20.7
##  3 G03    2.48  0.955 11.4     37.4  21.3
##  4 G04    2.92  0.76  11.1     36.9  22.0
##  5 G05    2.14  0.76  11.9     38.7  22.1
##  6 G06    2.86  0.65  12.2     38.1  21.2
##  7 G07    2.97  0.8   11.2     37.4  21.7
##  8 G08    1.65  0.9   11.8     36.6  21.1
##  9 G09    3.1   0.825 12.2     37.1  21.5
## 10 G10    3.00  0.785 12.0     37.8  20.4
## 11 G11    1.90  0.875  7.4     43.4  NA  
## 12 G12    1.64  1.16   7.45    44.4  NA  
## 13 G13    1.97  1.08   8.25    42.2  NA  
## 14 G14    2.45  0.865  7.65    38.4  NA  
## 15 G15    2.63  0.675  9.05    40.7  NA  
## 16 G16   NA    NA     NA       NA    19.1
## 17 G17   NA    NA     NA       NA    18.7
## 18 G18   NA    NA     NA       NA    19.2
## 19 G19   NA    NA     NA       NA    20.8
## 20 G20   NA    NA     NA       NA    20.7

Creating Groups

In data pre-processing it is sometimes necessary to create new variables needed for analysis.

dat <- dat %>% 
  mutate(
    lodging_grp= case_when(
      lodging >= 1 & lodging <= 1.49 ~ "No lodging",
      lodging >= 1.5 & lodging <= 2.49 ~ "Mild Lodging",
      lodging >= 2.5 & lodging <= 3.49 ~ "Moderate Lodging",
      lodging >= 3.5 ~ "Heavy Lodging"
    )
    )

dat_new <- dat %>% 
  mutate(
    lodging_grp2 = case_when(
      lodging_grp %in% c("Heavy Lodging", "Moderate Lodging") ~"Lodging",
      lodging_grp %in% c("No Lodging", "Mild Lodging")~ "No Lodging",
      TRUE ~ "NA")
  )
dat_new

## # A tibble: 464 × 12
##    env   loc    year gen   yield height lodging  size protein   oil lodging_grp 
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl> <chr>       
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9 Heavy Lodgi…
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7 Heavy Lodgi…
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3 Heavy Lodgi…
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0 Heavy Lodgi…
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1 Heavy Lodgi…
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2 Heavy Lodgi…
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7 Moderate Lo…
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1 Moderate Lo…
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5 Moderate Lo…
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4 Heavy Lodgi…
## # ℹ 454 more rows
## # ℹ 1 more variable: lodging_grp2 <chr>

Combining and Subsetting Strings

Some data sets may require splitting values or combining multiple cells before analysis, unite(), separate() and str_sub() functions are used to create new variables and columns.

BadData <- read_csv("data/BadData.csv")

## Rows: 360 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Gen
## dbl (6): Year, Trait1, Trait2, Trait3, Trait4, Trait5
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

BadData

## # A tibble: 360 × 7
##    Gen       Year Trait1 Trait2 Trait3 Trait4 Trait5
##    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 IBG10-1   2020     12     12     58     63     85
##  2 IBG10-2   2020      9      9     62     67     96
##  3 IBG10-3   2020     12     12     56     63     96
##  4 IBG11-1   2020     10     10     56     60     95
##  5 IBG11-2   2020      9      9     56     60    102
##  6 IBG11-3   2020     10     10     57     60     97
##  7 IBG12-1   2020     10     10     50     59     84
##  8 IBG12-2   2020     11     11     53     56     84
##  9 IBG12-3   2020     12     12     57     60     92
## 10 IBG126-1  2020     11     10     51     58     84
## # ℹ 350 more rows

BadData %<>% 
  separate(Gen,sep = "-", into = c("LocGen","Rep")) %>% 
  mutate(
    Location= str_sub(LocGen, 1,2),
    Genotype = str_sub(LocGen,3,5)
  ) %>% 
  select(Location,Genotype, Rep,Year, Trait1,Trait2,Trait3,Trait4,Trait5)


BadData %<>% unite(Location, Year, col= "Env", sep = "-") 
BadData

## # A tibble: 360 × 8
##    Env     Genotype Rep   Trait1 Trait2 Trait3 Trait4 Trait5
##    <chr>   <chr>    <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 IB-2020 G10      1         12     12     58     63     85
##  2 IB-2020 G10      2          9      9     62     67     96
##  3 IB-2020 G10      3         12     12     56     63     96
##  4 IB-2020 G11      1         10     10     56     60     95
##  5 IB-2020 G11      2          9      9     56     60    102
##  6 IB-2020 G11      3         10     10     57     60     97
##  7 IB-2020 G12      1         10     10     50     59     84
##  8 IB-2020 G12      2         11     11     53     56     84
##  9 IB-2020 G12      3         12     12     57     60     92
## 10 IB-2020 G12      1         11     10     51     58     84
## # ℹ 350 more rows

Practical Example

Import Example-02.csv to R and save to an object named example02

example02 <- read_csv("data/Example-02.csv")

## Rows: 464 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): env, loc, gen
## dbl (7): year, yield, height, lodging, size, protein, oil
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

example02

## # A tibble: 464 × 10
##    env   loc    year gen   yield height lodging  size protein   oil
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4
## # ℹ 454 more rows

Create a new variable called ENV by combining the first three letters in loc and the last two digit in year.

example1 <- example02 %>% 
  mutate(loc3 = str_sub(loc, 1,3),
         year2= str_sub(year,3,4)
  )
example1

## # A tibble: 464 × 12
##    env   loc    year gen   yield height lodging  size protein   oil loc3  year2
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl> <chr> <chr>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9 Law   70   
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7 Law   70   
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3 Law   70   
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0 Law   70   
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1 Law   70   
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2 Law   70   
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7 Law   70   
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1 Law   70   
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5 Law   70   
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4 Law   70   
## # ℹ 454 more rows

example1 %<>% unite(loc3, year2, col= "ENV", sep = "")
example1

## # A tibble: 464 × 11
##    env   loc    year gen   yield height lodging  size protein   oil ENV  
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl> <chr>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9 Law70
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7 Law70
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3 Law70
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0 Law70
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1 Law70
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2 Law70
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7 Law70
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1 Law70
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5 Law70
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4 Law70
## # ℹ 454 more rows

Categorize the yield values of at least 3 as “High” and other values as “Low” into a new variable called yield_grp.

example1 %<>% 
  mutate(yield_grp = case_when(
    yield <= 2.99 ~ "Low",
    yield >= 3 ~ "High"
  )
  )
example1

## # A tibble: 464 × 12
##    env   loc    year gen   yield height lodging  size protein   oil ENV  
##    <chr> <chr> <dbl> <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl> <chr>
##  1 L70   Lawes  1970 G01    2.39  1.44     4.25  8.45    36.7  20.9 Law70
##  2 L70   Lawes  1970 G02    2.28  1.45     4.25  9.95    37.6  20.7 Law70
##  3 L70   Lawes  1970 G03    2.57  1.46     3.75 10.8     37.8  21.3 Law70
##  4 L70   Lawes  1970 G04    2.88  1.26     3.5  10.0     38.4  22.0 Law70
##  5 L70   Lawes  1970 G05    2.39  1.34     3.5  11       37.5  22.1 Law70
##  6 L70   Lawes  1970 G06    2.41  1.36     4    11.8     38.2  21.2 Law70
##  7 L70   Lawes  1970 G07    2.70  1.3      3    11.8     37.4  21.7 Law70
##  8 L70   Lawes  1970 G08    2.46  0.955    3.25 10       35.2  21.1 Law70
##  9 L70   Lawes  1970 G09    2.57  1.03     3    11.2     35.9  21.5 Law70
## 10 L70   Lawes  1970 G10    2.98  1.16     3.75 10.8     39.7  20.4 Law70
## # ℹ 454 more rows
## # ℹ 1 more variable: yield_grp <chr>

Select ENV, gen, yield, yield_grp, height, and oil and transform the data set using appropriate variables as traits.

example3 <-  example1 %>% 
dplyr::rename( "Trait1"= "yield", 
        "Trait2"= "yield_grp" ,
         "Trait3"= "height" , 
           "Trait4" ="oil"  ) %>% 
  select(ENV, gen, Trait1, Trait2, Trait3, Trait4,)
  
 example3

## # A tibble: 464 × 6
##    ENV   gen   Trait1 Trait2 Trait3 Trait4
##    <chr> <chr>  <dbl> <chr>   <dbl>  <dbl>
##  1 Law70 G01     2.39 Low     1.44    20.9
##  2 Law70 G02     2.28 Low     1.45    20.7
##  3 Law70 G03     2.57 Low     1.46    21.3
##  4 Law70 G04     2.88 Low     1.26    22.0
##  5 Law70 G05     2.39 Low     1.34    22.1
##  6 Law70 G06     2.41 Low     1.36    21.2
##  7 Law70 G07     2.70 Low     1.3     21.7
##  8 Law70 G08     2.46 Low     0.955   21.1
##  9 Law70 G09     2.57 Low     1.03    21.5
## 10 Law70 G10     2.98 Low     1.16    20.4
## # ℹ 454 more rows

Design of Experiments

Experimental design is a systematic method used to determine the relationship between factors affecting a process and the output of that process. It is the process of preparing research in advance, which is then carried out in controlled fashion, so precision is maximized and specific conclusions can be drawn.

It is important to :

Identify as much sources of variation as possible.
Study and eliminate as much of the natural variation.
Prevent unremoved variation from confusing or biasing the effects being tested.
Detect cause and effect with minimal amount of experimental error.

When designing experiments, the goal is to make the treatment group and control group as homogeneous as possible. The researcher can use techniques such as blocking, proper plot techniques and data analysis to control experimental errors.
Packages used are agricolae, agricolaeplotr and desplot.

Features of an Experimental Design

Variables: dependent(response), independent (treatment factor) and extraneous variables are all present in an experimental design.
Setting: location of the experiment. The type of environment utilized in research is determined by the nature of the experiment.
Control: Except for the independent variable, the researcher must control all relevant variable.

Important Principles of Experimental Designs.

Randomization: This eliminates bias by evenly distributing variability.
Replication: this is achieved using standard error of the sample mean or confidence intervals. It is used to asses variability
Blocking: this is used to control sources of variation in experimental results by making treatment groups homogeneous.

Blocking

Choosing the Right Experiment

There are multiple things to consider before picking the experimental design for analysis of a data set. For example:

An experiment can be one or multiple treatments factor(factorial).
Replicated or unreplicated checks.
Replicated treatments.
Resources available to carry out the experiments.

Randomized Complete Block Design

Randomization in RCBD occurs after the units have been assigned to blocks.

Features of RCBD

Because the experimental units within a block must be homogeneous, it is best suited for a small number of treatments.
The effect of each treatment on the response must be approximately the same from block to block.
It is easy to construct
It is useful in comparing treatment means in the presence of a single extraneous source of variability.

Practical Example

An experimenter has fifteen genotypes to be tested with 3 replicates.

library(tidyverse)
library(agricolae)
library(desplot)
trt <- LETTERS[1:15]
outdesign <- design.rcbd(trt, 3,serie = 2,seed = 547)
book <- outdesign$book
print(outdesign$sketch)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] "J"  "D"  "E"  "K"  "A"  "C"  "B"  "N"  "F"  "I"   "H"   "G"   "O"   "M"  
## [2,] "B"  "G"  "M"  "A"  "E"  "L"  "K"  "O"  "J"  "N"   "C"   "H"   "D"   "I"  
## [3,] "C"  "I"  "O"  "D"  "F"  "H"  "L"  "A"  "E"  "N"   "G"   "B"   "M"   "J"  
##      [,15]
## [1,] "L"  
## [2,] "F"  
## [3,] "K"

Visualization of the plot using desplot

outdesign$bookRowCol <- outdesign$book %>% 
  mutate(Row= block %>% as.integer) %>% 
  group_by(Row) %>% 
  mutate( Col= plots %>% str_sub(2,3) %>% as.integer() )               
#Plot field layout using desplot
desplot(trt ~ Row + Col, flip = TRUE,
        data= outdesign$bookRowCol,
        out1 = block,
        text = trt, cex = 1, shorten = "no",
        main = "Randomized Complete Block Design",
      show.key = FALSE, 
     x = "Blocks" )

Plot Using agricolaeplotr

outdesign %>% agricolaeplotr::plot_rcdb(reverse_y = TRUE, reverse_x = TRUE, factor_name = "trt", y = "block", labels = "trt") + guides(fill= FALSE)

Splitplot Designs

These are multifactor experiments. The field is divided into whole plots and further divided into subplots. Whole plots serves as the experimental unit for a particular treatment, subplots serves as the experimental unit for another treatment.

The levels of all the factors are not randomly assigned and reset after each experiment.
There is loss of precision in the whole plot treatment comparison and statistical complexity.
There is restriction on random assignment.

Example

t1<-c("A","B","C","D")
t2<-c(1,2,3)
Splitplot <-design.split(t1,t2,r=3,serie=2,seed=45,kinds ="Super-Duper")
Split_book<- Splitplot$book

Visualization with desplot

Splitplot$bookRowCol <- Split_book %>% 
  mutate(block = paste0("Block",block),
         Col = plots %>% str_sub(2,3) %>%  as.integer,
         Row = splots %>% as.integer)

#Plot Field layout
desplot( block ~ Col + Row | block, flip = TRUE,
         out1 = Row, out1.gpar = list(col= "black",lty =1),
         out2 = Col, out2.gpar =  list(col = "black",lty = 1),
         text = t2, cex = 1, shorten = "no", col = t1,
         data = Splitplot$bookRowCol,
         main = "Split Plot Design",
         show.key = T, key.cex = 0.5)

Visualization with agricolaeplotr

Splitplot %>% 
  agricolaeplotr::plot_split_rcbd(factor_name_1 = "t1", factor_name_2 = "t2", subplots = TRUE, labels = "t1")

Incomplete Block Designs

Sometimes it is necessary to form small blocks with fewer plots than the number of genotypes due to heterogeneous field conditions or too many genotypes. Examples of these designs are Lattice and Latin Square Design, Augmented and partially replicated designs.

Latin Square Design

Number of treatments must be a perfect square. \(t=k^2\) where t is number of treatment and k is number of units per block.
Number of blocks per replicate is equal to plots per block. \(s=k\) and both the square root of t.

trtL <- letters[1:5] #trt creates a perfect square
Latin <- design.lsd(trtL, serie = 2, 784)
Latin_book <- Latin$book
print(Latin$sketch)

##      [,1] [,2] [,3] [,4] [,5]
## [1,] "a"  "b"  "e"  "d"  "c" 
## [2,] "e"  "a"  "d"  "c"  "b" 
## [3,] "b"  "c"  "a"  "e"  "d" 
## [4,] "c"  "d"  "b"  "a"  "e" 
## [5,] "d"  "e"  "c"  "b"  "a"

Visualization using desplot

# Adding rows and columns
Latin$bookRowCol <- Latin_book %>% 
  mutate(Row= row %>% as.integer,
         Col= col %>% as.integer)
#Plot field layout
desplot(trtL ~ Row + Col, Flip = TRUE,
        out1 = Row, out1.gpar = list(col="black",lwd=3),
        out2 = Col, out2.gpar = list(col="black", lwd=3),
        text = trtL, cex= 1, shorten = "no",
        data =Latin$bookRowCol,
        main = "Latin Square Design",
        show.key = F)

Visualization using agricolaeplotr

Latin %>%  
  agricolaeplotr::plot_latin_square(reverse_x = TRUE, factor_name = "trtL", labels = "trtL") + guides(fill= "none")

Lattice Design

They are special cases of incomplete block designs.

\(N=b*k=r*t\), where N is total number of experimental units, b is total number of blocks in the experiment, k is number of units per block, r is replicate, t is number of treatment.

trtLT <- letters[1:25] #trt has have factors of a perfect square
lattice <- design.lattice(trtLT, 2, serie=2, 784)

## 
## Lattice design,  simple   5 x 5 
## 
## Efficiency factor
## (E ) 0.75 
## 
## <<< Book >>>

lattice_book <- lattice$book
print(lattice$sketch)

## $rep1
##      [,1] [,2] [,3] [,4] [,5]
## [1,] "c"  "o"  "j"  "y"  "i" 
## [2,] "k"  "d"  "v"  "h"  "e" 
## [3,] "x"  "u"  "a"  "f"  "w" 
## [4,] "s"  "p"  "t"  "m"  "n" 
## [5,] "b"  "g"  "q"  "l"  "r" 
## 
## $rep2
##      [,1] [,2] [,3] [,4] [,5]
## [1,] "a"  "j"  "t"  "q"  "v" 
## [2,] "x"  "c"  "s"  "b"  "k" 
## [3,] "w"  "i"  "n"  "r"  "e" 
## [4,] "u"  "o"  "p"  "g"  "d" 
## [5,] "f"  "y"  "m"  "l"  "h"

Visualization using desplot

# Add Rows and columns
lattice$bookRowCol <- lattice_book %>% 
 mutate(r = paste0("Rep", r),
         col = block %>% as.integer,
         Row = plots  %>% str_sub(2,3) %>%  as.integer
        )

# Plot design using desplot
s <- desplot (trt ~ Row + col |r , Flip = TRUE,
        out1 = Row, out1.gpar = list(col= "black", lwd= 1, lty= 1 ),
        out2 = col,  out2.gpar = list(col= "black", lty= 1), 
        text = trt, shorten = "no", cex = 1,
        midpoint = "NULL",
        data =lattice$bookRowCol,
        main = "Lattice Design",
       show.key = F,
       ylab = "Blocks"
       ) 
s

Visualization using agricolaeplotr

lattice %>% 
  agricolaeplotr::plot_lattice_simple(reverse_x = TRUE, factor_name = "trt", labels = "trt", y="block"
                   ) %>% 
   + guides(fill= "none") +facet_wrap(~r)

Augmented Design

This design is best suited for experiments with limited test treatment, limited resources, difficulty in keeping blocks homogeneous, testing genotypes in as many environments which leads to a single replicate of genotypes.

Features of Augmented Designs.

\(N=n+r*c\), where N is number of plots, n is number of entries, c is number of checks.
New or test treatments are not replicated or have fewer replicates than the checks.
Evaluates more genotypes and tests in more environments.
Observations on new genotypes can be adjusted for field heterogeneity.
It is flexible
Significant resources are spent on producing checks
Experimental error has a limited number of degrees of freedom which limits the power to detect changes across treatment.

Practical Example

A plant breeder has 53 genotypes, 3 of which are standard checks to be replicated in each block

trtA <- letters[1:3] #control 
trtAA <- 1:50 #treatment 
aug <- design.dau(trt1 = trtA,trt2 = trtAA, 7, serie = 2, 998)
book.a <- aug$book
by(book.a,book.a[2],function(x) paste(x[,1],"-",as.character(x[,3])))

## block: 1
##  [1] "101 - 44" "102 - 46" "103 - 20" "104 - 8"  "105 - a"  "106 - 14"
##  [7] "107 - b"  "108 - 24" "109 - c"  "110 - 29" "111 - 30"
## ------------------------------------------------------------ 
## block: 2
##  [1] "201 - 23" "202 - 40" "203 - 4"  "204 - 1"  "205 - 39" "206 - 48"
##  [7] "207 - c"  "208 - a"  "209 - 37" "210 - b" 
## ------------------------------------------------------------ 
## block: 3
##  [1] "301 - 34" "302 - c"  "303 - b"  "304 - 5"  "305 - 12" "306 - 9" 
##  [7] "307 - a"  "308 - 35" "309 - 28" "310 - 27"
## ------------------------------------------------------------ 
## block: 4
##  [1] "401 - 47" "402 - a"  "403 - 2"  "404 - c"  "405 - 49" "406 - 16"
##  [7] "407 - 45" "408 - b"  "409 - 43" "410 - 13"
## ------------------------------------------------------------ 
## block: 5
##  [1] "501 - 21" "502 - c"  "503 - 22" "504 - 17" "505 - b"  "506 - 38"
##  [7] "507 - 26" "508 - a"  "509 - 15" "510 - 11"
## ------------------------------------------------------------ 
## block: 6
##  [1] "601 - c"  "602 - 19" "603 - 31" "604 - 10" "605 - b"  "606 - 41"
##  [7] "607 - 33" "608 - a"  "609 - 18" "610 - 7" 
## ------------------------------------------------------------ 
## block: 7
##  [1] "701 - 42" "702 - b"  "703 - c"  "704 - 25" "705 - 32" "706 - 36"
##  [7] "707 - 50" "708 - 6"  "709 - a"  "710 - 3"

Visualization using desplot

augmentedRowCol <- book.a %>% 
  mutate(Col = block %>% as.integer,
         Row = plots %>% str_sub(2,3) %>% as.integer)

#plot field layout using desplot
desplot(trt~ Row + Col, flip = TRUE,
        out1 = block, out1.gpar = list(col="black", lwd= 2, lty =3),
        text= trt, cex = 1, shorten = "no",
        data= augmentedRowCol,
        main = "Augmented Design",
        show.key = F
       )

Visualization using agricolaeplotr

aug %>% 
  agricolaeplotr::plot_dau(y = "block", factor_name = "trt", labels = "trt") %>% 
+ guides(fill= "none") + labs(title= "Augmented Design")

Alpha designs

Suitable for large numbers of genotypes.
Random allocation of treatments within a block.
Not all treatments appear in a block
All pairs of treatments appear equally in a trial
\(t= s*k\), where t is number of treatments in each replicate, s is number of blocks in each replicate and k is number of units per block.

Pratical Example

trtA <- 1:102
tl <- length(trtA)
k <- 6
s <- tl/k
r <- 3
AlphaDesign <- design.alpha(trtA, k, r, serie = 2, 998)

## 
## Alpha Design (0,1) - Serie  II 
## 
## Parameters Alpha Design
## =======================
## Treatmeans : 102
## Block size : 6
## Blocks     : 17
## Replication: 3 
## 
## Efficiency factor
## (E ) 0.808 
## 
## <<< Book >>>

Alpha_book <- AlphaDesign$book
plots <- Alpha_book[,1]
dim(plots) <- c(k,s,r)
for (i in 1:r) {print(t(plots[,,i]))
}

##       [,1] [,2] [,3] [,4] [,5] [,6]
##  [1,]  101  102  103  104  105  106
##  [2,]  107  108  109  110  111  112
##  [3,]  113  114  115  116  117  118
##  [4,]  119  120  121  122  123  124
##  [5,]  125  126  127  128  129  130
##  [6,]  131  132  133  134  135  136
##  [7,]  137  138  139  140  141  142
##  [8,]  143  144  145  146  147  148
##  [9,]  149  150  151  152  153  154
## [10,]  155  156  157  158  159  160
## [11,]  161  162  163  164  165  166
## [12,]  167  168  169  170  171  172
## [13,]  173  174  175  176  177  178
## [14,]  179  180  181  182  183  184
## [15,]  185  186  187  188  189  190
## [16,]  191  192  193  194  195  196
## [17,]  197  198  199  200  201  202
##       [,1] [,2] [,3] [,4] [,5] [,6]
##  [1,]  201  202  203  204  205  206
##  [2,]  207  208  209  210  211  212
##  [3,]  213  214  215  216  217  218
##  [4,]  219  220  221  222  223  224
##  [5,]  225  226  227  228  229  230
##  [6,]  231  232  233  234  235  236
##  [7,]  237  238  239  240  241  242
##  [8,]  243  244  245  246  247  248
##  [9,]  249  250  251  252  253  254
## [10,]  255  256  257  258  259  260
## [11,]  261  262  263  264  265  266
## [12,]  267  268  269  270  271  272
## [13,]  273  274  275  276  277  278
## [14,]  279  280  281  282  283  284
## [15,]  285  286  287  288  289  290
## [16,]  291  292  293  294  295  296
## [17,]  297  298  299  300  301  302
##       [,1] [,2] [,3] [,4] [,5] [,6]
##  [1,]  301  302  303  304  305  306
##  [2,]  307  308  309  310  311  312
##  [3,]  313  314  315  316  317  318
##  [4,]  319  320  321  322  323  324
##  [5,]  325  326  327  328  329  330
##  [6,]  331  332  333  334  335  336
##  [7,]  337  338  339  340  341  342
##  [8,]  343  344  345  346  347  348
##  [9,]  349  350  351  352  353  354
## [10,]  355  356  357  358  359  360
## [11,]  361  362  363  364  365  366
## [12,]  367  368  369  370  371  372
## [13,]  373  374  375  376  377  378
## [14,]  379  380  381  382  383  384
## [15,]  385  386  387  388  389  390
## [16,]  391  392  393  394  395  396
## [17,]  397  398  399  400  401  402

print(AlphaDesign$sketch)

## $rep1
##       [,1]  [,2] [,3]  [,4]  [,5] [,6]
##  [1,] "23"  "93" "79"  "4"   "43" "18"
##  [2,] "28"  "39" "34"  "99"  "36" "91"
##  [3,] "26"  "31" "44"  "84"  "25" "3" 
##  [4,] "57"  "1"  "90"  "73"  "69" "45"
##  [5,] "80"  "10" "82"  "101" "37" "30"
##  [6,] "83"  "78" "61"  "42"  "47" "38"
##  [7,] "62"  "64" "66"  "7"   "75" "74"
##  [8,] "17"  "51" "35"  "94"  "55" "24"
##  [9,] "54"  "6"  "67"  "8"   "13" "95"
## [10,] "100" "5"  "87"  "60"  "49" "22"
## [11,] "52"  "53" "21"  "9"   "96" "14"
## [12,] "97"  "98" "92"  "29"  "2"  "48"
## [13,] "32"  "16" "27"  "46"  "72" "56"
## [14,] "58"  "77" "89"  "65"  "50" "19"
## [15,] "88"  "85" "40"  "59"  "12" "81"
## [16,] "15"  "41" "70"  "68"  "11" "63"
## [17,] "76"  "86" "102" "71"  "33" "20"
## 
## $rep2
##       [,1]  [,2] [,3]  [,4]  [,5] [,6]
##  [1,] "2"   "27" "68"  "42"  "69" "99"
##  [2,] "75"  "54" "33"  "23"  "58" "10"
##  [3,] "92"  "83" "16"  "41"  "45" "55"
##  [4,] "81"  "53" "87"  "3"   "79" "94"
##  [5,] "76"  "38" "101" "56"  "13" "34"
##  [6,] "84"  "12" "17"  "29"  "90" "70"
##  [7,] "73"  "31" "98"  "32"  "63" "51"
##  [8,] "39"  "47" "80"  "1"   "72" "95"
##  [9,] "48"  "24" "44"  "15"  "9"  "88"
## [10,] "100" "85" "25"  "35"  "11" "14"
## [11,] "86"  "89" "36"  "6"   "37" "74"
## [12,] "52"  "4"  "77"  "102" "5"  "66"
## [13,] "30"  "43" "50"  "20"  "62" "49"
## [14,] "78"  "57" "91"  "8"   "46" "97"
## [15,] "26"  "21" "22"  "7"   "59" "18"
## [16,] "40"  "19" "93"  "64"  "60" "96"
## [17,] "65"  "61" "67"  "82"  "71" "28"
## 
## $rep3
##       [,1]  [,2] [,3]  [,4]  [,5] [,6]
##  [1,] "47"  "28" "73"  "27"  "48" "54"
##  [2,] "97"  "52" "17"  "25"  "41" "59"
##  [3,] "1"   "3"  "24"  "2"   "40" "63"
##  [4,] "53"  "43" "74"  "31"  "88" "60"
##  [5,] "6"   "16" "30"  "90"  "78" "34"
##  [6,] "94"  "15" "61"  "72"  "69" "98"
##  [7,] "56"  "26" "92"  "57"  "70" "35"
##  [8,] "19"  "37" "83"  "20"  "91" "13"
##  [9,] "23"  "85" "84"  "5"   "55" "21"
## [10,] "68"  "96" "51"  "44"  "81" "49"
## [11,] "9"   "87" "62"  "76"  "93" "89"
## [12,] "101" "64" "86"  "8"   "50" "79"
## [13,] "80"  "71" "100" "66"  "18" "58"
## [14,] "10"  "32" "39"  "102" "67" "42"
## [15,] "4"   "65" "12"  "75"  "22" "14"
## [16,] "77"  "95" "7"   "82"  "33" "99"
## [17,] "45"  "36" "38"  "46"  "29" "11"

Visualization using desplot

# Add Row and columns
alphabookRowCol <- Alpha_book %>% 
  mutate(replication = paste0("Replication", replication),
         Row = block %>% as.integer,
         Col = cols %>% as.integer)

#field layout using desplot

desplot(alphabookRowCol, trtA ~ Row + Col|replication, flip = TRUE,
        out1 = replication, out1.gpar = list(col= "black", lyt= 3, lwd= 3),
        out2 = block, out2.gpar = list(col="black",lyt= 2),
        text = trtA,cex=1, shorten = "no",
        main = "Alpha Design",
        show.key = F)

Visualization using agricolae plotr

AlphaDesign %>% 
  agricolaeplotr::plot_alpha( x= "cols", y= "block", factor_name = "trtA", labels = "trtA") %>% 
  + guides(fill= "none")+ facet_wrap(~replication)

Partially Replicated Designs

P-rep designs are useful when an augmented design is using control plots where seed are limited and full replication is not feasible.
A portion of the test genotypes are replicated and the rest appears once.
The design is constructed in such a way that duplicated treatments can be used to estimate the residual variance.

Practical Example

A breeder has 106 test materials and 4 checks. She wishes to replicate her trial 3 times.

T1 <- letters[1:4]
T2 <- 1:106
design <- design.dau(T1,T2,3,serie = 2)
book.c <- design$book

Visualization using Desplot

# adding Row and column
pRowlCol <- book.c %>% 
  mutate(Row = plots %>% str_sub(2,3) %>% as.integer,
         Col = block %>% as.integer)

# field layout using desplot
desplot(trt ~ Row + Col, flip = TRUE,
        out1 = block, out1.gpar = list(col= "black", lwd= 2, lty= 3),
        text = trt, cex = 1, shorten = "no",
        data= pRowlCol,
        main= "P-rep Design",
        show.key = F)

Visualization using agricolaeplotr

design %>% 
  agricolaeplotr::plot_dau(y= "block", factor_name = "trt",labels = "trt") %>% 
+ guides(fill= "none")+ labs(title= "P-rep Design")

Linear Models: Analysis of Variance

Analysis of Variance (ANOVA) is used to determine differences between research results and experimental samples.

One way ANOVA focuses on one treatment factor while two way ANOVA focuses on multiple treatment factors.
Differences among measurements for same group is called experimental error (variance within groups), in order to have experimental error we need replication.
Differences between two unrelated groups is called variance between groups.
Fixed effects occurs when treatments are well defined, easily replicable and expected to yield same impact on average in each replicate while random effect occurs if the treatments cannot be considered to come from a predefined or known set (random sampling from a larger population). When the researcher’s aim is comparation of genotypes, then the factors are considered fixed, but when the aim of the researcher is selection the factors are considered random.
Heritability is the amount of phenotypic variation in a population, that is attributable to individual genetic differences. It is the ratio of variation due to differences between genotypes to the phenotypic variation for a trait in a population. \(h^2= \frac{σ_g^2}{{σ_g^2+ \frac{σ_e^2}{r}}}\)
Packages used are lmerTest and emmeans.

Principle of Analysis of Variance

If experimental error is higher than the variance between groups, we cannot conclude that there is a difference between the two groups.

Practical Example

One way ANOVA

Let’s consider the data Omer.Sorghum.E3.csv of the environment E3. Run a One-way ANVOA

library(lmerTest)
library(emmeans)

## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'

omerE3 <- read_csv("data/Omer.Sorghum.E3.csv")

## Rows: 72 Columns: 4

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): env, rep, gen
## dbl (1): yield
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

omerE3

## # A tibble: 72 × 4
##    env   rep   gen   yield
##    <chr> <chr> <chr> <dbl>
##  1 E3    R1    G01    449.
##  2 E3    R1    G02    458.
##  3 E3    R1    G03    545.
##  4 E3    R1    G04    547.
##  5 E3    R1    G05    784.
##  6 E3    R1    G06    501.
##  7 E3    R1    G07   1099.
##  8 E3    R1    G08    882.
##  9 E3    R1    G09    821.
## 10 E3    R1    G10    918.
## # ℹ 62 more rows

summary(omerE3)

##      env                rep                gen                yield        
##  Length:72          Length:72          Length:72          Min.   :  91.26  
##  Class :character   Class :character   Class :character   1st Qu.: 511.78  
##  Mode  :character   Mode  :character   Mode  :character   Median : 669.10  
##                                                           Mean   : 671.15  
##                                                           3rd Qu.: 808.90  
##                                                           Max.   :1327.45

unique(omerE3$env)

## [1] "E3"

unique(omerE3$rep)

## [1] "R1" "R2" "R3" "R4"

unique(omerE3$gen)

##  [1] "G01" "G02" "G03" "G04" "G05" "G06" "G07" "G08" "G09" "G10" "G11" "G12"
## [13] "G13" "G14" "G15" "G16" "G17" "G18"

One way ANOVA example cont’d

model.E3 <- lmer(yield ~ gen + (1|rep), data = omerE3) ## factors considered fixed
testE3 <- anova(model.E3, ddf = "Kenward-Roger", type = 3)
testE3

## Type III Analysis of Variance Table with Kenward-Roger's method
##      Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
## gen 2478067  145769    17    51   5.688 6.427e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimate the adjusted means

lsmeans.E3 <- emmeans(model.E3, ~ gen)
lsmeans.E3

##  gen emmean   SE   df lower.CL upper.CL
##  G01    582 84.5 45.9      412      752
##  G02    515 84.5 45.9      345      685
##  G03    805 84.5 45.9      635      975
##  G04    749 84.5 45.9      579      919
##  G05    643 84.5 45.9      473      813
##  G06    315 84.5 45.9      145      485
##  G07    937 84.5 45.9      767     1107
##  G08    741 84.5 45.9      571      911
##  G09    813 84.5 45.9      643      983
##  G10    739 84.5 45.9      569      909
##  G11    728 84.5 45.9      558      898
##  G12    539 84.5 45.9      369      709
##  G13    825 84.5 45.9      655      995
##  G14    583 84.5 45.9      413      753
##  G15    647 84.5 45.9      477      817
##  G16    605 84.5 45.9      435      775
##  G17    285 84.5 45.9      115      455
##  G18   1030 84.5 45.9      860     1200
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95

pairs(lsmeans.E3)

##  contrast  estimate  SE df t.ratio p.value
##  G01 - G02    66.52 113 51   0.588  1.0000
##  G01 - G03  -223.07 113 51  -1.971  0.8690
##  G01 - G04  -167.40 113 51  -1.479  0.9885
##  G01 - G05   -61.05 113 51  -0.539  1.0000
##  G01 - G06   267.20 113 51   2.361  0.6381
##  G01 - G07  -355.20 113 51  -3.138  0.1798
##  G01 - G08  -159.53 113 51  -1.409  0.9930
##  G01 - G09  -231.31 113 51  -2.043  0.8340
##  G01 - G10  -156.66 113 51  -1.384  0.9943
##  G01 - G11  -146.29 113 51  -1.292  0.9973
##  G01 - G12    42.87 113 51   0.379  1.0000
##  G01 - G13  -243.18 113 51  -2.148  0.7759
##  G01 - G14    -1.06 113 51  -0.009  1.0000
##  G01 - G15   -64.72 113 51  -0.572  1.0000
##  G01 - G16   -22.81 113 51  -0.201  1.0000
##  G01 - G17   296.66 113 51   2.621  0.4581
##  G01 - G18  -448.28 113 51  -3.960  0.0227
##  G02 - G03  -289.59 113 51  -2.558  0.5006
##  G02 - G04  -233.91 113 51  -2.066  0.8220
##  G02 - G05  -127.57 113 51  -1.127  0.9995
##  G02 - G06   200.69 113 51   1.773  0.9402
##  G02 - G07  -421.72 113 51  -3.726  0.0434
##  G02 - G08  -226.04 113 51  -1.997  0.8569
##  G02 - G09  -297.82 113 51  -2.631  0.4513
##  G02 - G10  -223.18 113 51  -1.972  0.8686
##  G02 - G11  -212.80 113 51  -1.880  0.9061
##  G02 - G12   -23.64 113 51  -0.209  1.0000
##  G02 - G13  -309.69 113 51  -2.736  0.3835
##  G02 - G14   -67.58 113 51  -0.597  1.0000
##  G02 - G15  -131.24 113 51  -1.159  0.9993
##  G02 - G16   -89.32 113 51  -0.789  1.0000
##  G02 - G17   230.14 113 51   2.033  0.8392
##  G02 - G18  -514.79 113 51  -4.548  0.0039
##  G03 - G04    55.67 113 51   0.492  1.0000
##  G03 - G05   162.02 113 51   1.431  0.9918
##  G03 - G06   490.28 113 51   4.331  0.0076
##  G03 - G07  -132.13 113 51  -1.167  0.9992
##  G03 - G08    63.55 113 51   0.561  1.0000
##  G03 - G09    -8.23 113 51  -0.073  1.0000
##  G03 - G10    66.41 113 51   0.587  1.0000
##  G03 - G11    76.79 113 51   0.678  1.0000
##  G03 - G12   265.95 113 51   2.349  0.6458
##  G03 - G13   -20.10 113 51  -0.178  1.0000
##  G03 - G14   222.01 113 51   1.961  0.8732
##  G03 - G15   158.35 113 51   1.399  0.9936
##  G03 - G16   200.27 113 51   1.769  0.9412
##  G03 - G17   519.73 113 51   4.591  0.0034
##  G03 - G18  -225.20 113 51  -1.989  0.8604
##  G04 - G05   106.35 113 51   0.939  1.0000
##  G04 - G06   434.61 113 51   3.839  0.0319
##  G04 - G07  -187.81 113 51  -1.659  0.9659
##  G04 - G08     7.87 113 51   0.070  1.0000
##  G04 - G09   -63.91 113 51  -0.565  1.0000
##  G04 - G10    10.74 113 51   0.095  1.0000
##  G04 - G11    21.11 113 51   0.187  1.0000
##  G04 - G12   210.27 113 51   1.858  0.9141
##  G04 - G13   -75.78 113 51  -0.669  1.0000
##  G04 - G14   166.34 113 51   1.469  0.9892
##  G04 - G15   102.68 113 51   0.907  1.0000
##  G04 - G16   144.59 113 51   1.277  0.9976
##  G04 - G17   464.06 113 51   4.100  0.0152
##  G04 - G18  -280.88 113 51  -2.481  0.5540
##  G05 - G06   328.26 113 51   2.900  0.2885
##  G05 - G07  -294.15 113 51  -2.599  0.4731
##  G05 - G08   -98.47 113 51  -0.870  1.0000
##  G05 - G09  -170.25 113 51  -1.504  0.9864
##  G05 - G10   -95.61 113 51  -0.845  1.0000
##  G05 - G11   -85.23 113 51  -0.753  1.0000
##  G05 - G12   103.92 113 51   0.918  1.0000
##  G05 - G13  -182.12 113 51  -1.609  0.9741
##  G05 - G14    59.99 113 51   0.530  1.0000
##  G05 - G15    -3.67 113 51  -0.032  1.0000
##  G05 - G16    38.24 113 51   0.338  1.0000
##  G05 - G17   357.71 113 51   3.160  0.1715
##  G05 - G18  -387.23 113 51  -3.421  0.0945
##  G06 - G07  -622.41 113 51  -5.498  0.0002
##  G06 - G08  -426.73 113 51  -3.770  0.0385
##  G06 - G09  -498.51 113 51  -4.404  0.0061
##  G06 - G10  -423.87 113 51  -3.744  0.0413
##  G06 - G11  -413.49 113 51  -3.653  0.0526
##  G06 - G12  -224.33 113 51  -1.982  0.8640
##  G06 - G13  -510.38 113 51  -4.509  0.0044
##  G06 - G14  -268.26 113 51  -2.370  0.6317
##  G06 - G15  -331.93 113 51  -2.932  0.2716
##  G06 - G16  -290.01 113 51  -2.562  0.4980
##  G06 - G17    29.45 113 51   0.260  1.0000
##  G06 - G18  -715.48 113 51  -6.321  <.0001
##  G07 - G08   195.68 113 51   1.729  0.9514
##  G07 - G09   123.90 113 51   1.095  0.9996
##  G07 - G10   198.54 113 51   1.754  0.9452
##  G07 - G11   208.92 113 51   1.846  0.9182
##  G07 - G12   398.08 113 51   3.517  0.0746
##  G07 - G13   112.03 113 51   0.990  0.9999
##  G07 - G14   354.14 113 51   3.129  0.1834
##  G07 - G15   290.49 113 51   2.566  0.4952
##  G07 - G16   332.40 113 51   2.936  0.2695
##  G07 - G17   651.86 113 51   5.759  0.0001
##  G07 - G18   -93.07 113 51  -0.822  1.0000
##  G08 - G09   -71.78 113 51  -0.634  1.0000
##  G08 - G10     2.87 113 51   0.025  1.0000
##  G08 - G11    13.24 113 51   0.117  1.0000
##  G08 - G12   202.40 113 51   1.788  0.9360
##  G08 - G13   -83.65 113 51  -0.739  1.0000
##  G08 - G14   158.47 113 51   1.400  0.9935
##  G08 - G15    94.81 113 51   0.838  1.0000
##  G08 - G16   136.72 113 51   1.208  0.9988
##  G08 - G17   456.19 113 51   4.030  0.0186
##  G08 - G18  -288.75 113 51  -2.551  0.5057
##  G09 - G10    74.64 113 51   0.659  1.0000
##  G09 - G11    85.02 113 51   0.751  1.0000
##  G09 - G12   274.18 113 51   2.422  0.5954
##  G09 - G13   -11.87 113 51  -0.105  1.0000
##  G09 - G14   230.25 113 51   2.034  0.8388
##  G09 - G15   166.59 113 51   1.472  0.9891
##  G09 - G16   208.50 113 51   1.842  0.9194
##  G09 - G17   527.96 113 51   4.664  0.0027
##  G09 - G18  -216.97 113 51  -1.917  0.8920
##  G10 - G11    10.38 113 51   0.092  1.0000
##  G10 - G12   199.53 113 51   1.763  0.9430
##  G10 - G13   -86.51 113 51  -0.764  1.0000
##  G10 - G14   155.60 113 51   1.375  0.9947
##  G10 - G15    91.94 113 51   0.812  1.0000
##  G10 - G16   133.85 113 51   1.182  0.9991
##  G10 - G17   453.32 113 51   4.005  0.0200
##  G10 - G18  -291.62 113 51  -2.576  0.4883
##  G11 - G12   189.16 113 51   1.671  0.9637
##  G11 - G13   -96.89 113 51  -0.856  1.0000
##  G11 - G14   145.23 113 51   1.283  0.9975
##  G11 - G15    81.57 113 51   0.721  1.0000
##  G11 - G16   123.48 113 51   1.091  0.9997
##  G11 - G17   442.94 113 51   3.913  0.0260
##  G11 - G18  -301.99 113 51  -2.668  0.4269
##  G12 - G13  -286.05 113 51  -2.527  0.5222
##  G12 - G14   -43.93 113 51  -0.388  1.0000
##  G12 - G15  -107.59 113 51  -0.950  0.9999
##  G12 - G16   -65.68 113 51  -0.580  1.0000
##  G12 - G17   253.78 113 51   2.242  0.7177
##  G12 - G18  -491.15 113 51  -4.339  0.0074
##  G13 - G14   242.12 113 51   2.139  0.7814
##  G13 - G15   178.46 113 51   1.576  0.9786
##  G13 - G16   220.37 113 51   1.947  0.8795
##  G13 - G17   539.83 113 51   4.769  0.0019
##  G13 - G18  -205.10 113 51  -1.812  0.9290
##  G14 - G15   -63.66 113 51  -0.562  1.0000
##  G14 - G16   -21.75 113 51  -0.192  1.0000
##  G14 - G17   297.72 113 51   2.630  0.4519
##  G14 - G18  -447.22 113 51  -3.951  0.0233
##  G15 - G16    41.91 113 51   0.370  1.0000
##  G15 - G17   361.38 113 51   3.192  0.1598
##  G15 - G18  -383.56 113 51  -3.388  0.1021
##  G16 - G17   319.46 113 51   2.822  0.3316
##  G16 - G18  -425.47 113 51  -3.759  0.0397
##  G17 - G18  -744.93 113 51  -6.581  <.0001
## 
## Degrees-of-freedom method: kenward-roger 
## P value adjustment: tukey method for comparing a family of 18 estimates

What genotypes are different to G05?

emmeans(model.E3, specs = trt.vs.ctrl ~ gen, ref= 05)

## $emmeans
##  gen emmean   SE   df lower.CL upper.CL
##  G01    582 84.5 45.9      412      752
##  G02    515 84.5 45.9      345      685
##  G03    805 84.5 45.9      635      975
##  G04    749 84.5 45.9      579      919
##  G05    643 84.5 45.9      473      813
##  G06    315 84.5 45.9      145      485
##  G07    937 84.5 45.9      767     1107
##  G08    741 84.5 45.9      571      911
##  G09    813 84.5 45.9      643      983
##  G10    739 84.5 45.9      569      909
##  G11    728 84.5 45.9      558      898
##  G12    539 84.5 45.9      369      709
##  G13    825 84.5 45.9      655      995
##  G14    583 84.5 45.9      413      753
##  G15    647 84.5 45.9      477      817
##  G16    605 84.5 45.9      435      775
##  G17    285 84.5 45.9      115      455
##  G18   1030 84.5 45.9      860     1200
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast  estimate  SE df t.ratio p.value
##  G01 - G05   -61.05 113 51  -0.539  0.9963
##  G02 - G05  -127.57 113 51  -1.127  0.8940
##  G03 - G05   162.02 113 51   1.431  0.7490
##  G04 - G05   106.35 113 51   0.939  0.9500
##  G06 - G05  -328.26 113 51  -2.900  0.0656
##  G07 - G05   294.15 113 51   2.599  0.1309
##  G08 - G05    98.47 113 51   0.870  0.9643
##  G09 - G05   170.25 113 51   1.504  0.7063
##  G10 - G05    95.61 113 51   0.845  0.9687
##  G11 - G05    85.23 113 51   0.753  0.9815
##  G12 - G05  -103.92 113 51  -0.918  0.9547
##  G13 - G05   182.12 113 51   1.609  0.6413
##  G14 - G05   -59.99 113 51  -0.530  0.9966
##  G15 - G05     3.67 113 51   0.032  1.0000
##  G16 - G05   -38.24 113 51  -0.338  0.9997
##  G17 - G05  -357.71 113 51  -3.160  0.0340
##  G18 - G05   387.23 113 51   3.421  0.0168
## 
## Degrees-of-freedom method: kenward-roger 
## P value adjustment: dunnettx method for 17 tests

Two way ANOVA

omer.S <- read_csv("data/omer.sorghum.csv")

## Rows: 432 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): env, rep, gen
## dbl (1): yield
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

omer.S

## # A tibble: 432 × 4
##    env   rep   gen   yield
##    <chr> <chr> <chr> <dbl>
##  1 E1    R1    G01   140. 
##  2 E1    R1    G02   183. 
##  3 E1    R1    G03   172. 
##  4 E1    R1    G04   200. 
##  5 E1    R1    G05   157. 
##  6 E1    R1    G06   177. 
##  7 E1    R1    G07    55.4
##  8 E1    R1    G08    89.5
##  9 E1    R1    G09   296. 
## 10 E1    R1    G10   312. 
## # ℹ 422 more rows

summary(omer.S)

##      env                rep                gen                yield        
##  Length:432         Length:432         Length:432         Min.   :  10.17  
##  Class :character   Class :character   Class :character   1st Qu.: 182.07  
##  Mode  :character   Mode  :character   Mode  :character   Median : 339.88  
##                                                           Mean   : 496.11  
##                                                           3rd Qu.: 728.59  
##                                                           Max.   :2037.63

unique(omer.S$env)

## [1] "E1" "E2" "E3" "E4" "E5" "E6"

unique(omer.S$rep)

## [1] "R1" "R2" "R3" "R4"

unique(omer.S$gen)

##  [1] "G01" "G02" "G03" "G04" "G05" "G06" "G07" "G08" "G09" "G10" "G11" "G12"
## [13] "G13" "G14" "G15" "G16" "G17" "G18"

Run a two-way ANOVA (genotype and environment) considering these two factors and the interaction as fixed.

model.S <- lmer(yield ~ gen + env + gen:env + (1|env:rep), omer.S) ## factors are considered fixed
test.S <-  anova(model.S, ddf = "Kenward-Roger", type = 3)
test.S

## Type III Analysis of Variance Table with Kenward-Roger's method
##           Sum Sq Mean Sq NumDF DenDF  F value    Pr(>F)    
## gen      2347588  138093    17   306   5.6000 5.183e-11 ***
## env     29552101 5910420     5    18 239.6815 8.225e-16 ***
## gen:env  9352495  110029    85   306   4.4619 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lsmeans.S <- emmeans(model.S, ~ gen:env)
lsmeans.S

##  gen env emmean   SE  df lower.CL upper.CL
##  G01 E1   130.6 80.3 313  -27.475      289
##  G02 E1   182.5 80.3 313   24.465      341
##  G03 E1   159.3 80.3 313    1.253      317
##  G04 E1   185.6 80.3 313   27.510      344
##  G05 E1   157.2 80.3 313   -0.825      315
##  G06 E1   164.6 80.3 313    6.548      323
##  G07 E1    36.6 80.3 313 -121.467      195
##  G08 E1   122.7 80.3 313  -35.385      281
##  G09 E1   235.5 80.3 313   77.485      394
##  G10 E1   286.6 80.3 313  128.565      445
##  G11 E1   177.1 80.3 313   19.013      335
##  G12 E1   157.3 80.3 313   -0.740      315
##  G13 E1    96.4 80.3 313  -61.695      254
##  G14 E1   100.6 80.3 313  -57.422      259
##  G15 E1    22.3 80.3 313 -135.737      180
##  G16 E1   188.1 80.3 313   30.053      346
##  G17 E1    19.8 80.3 313 -138.245      178
##  G18 E1   174.5 80.3 313   16.430      333
##  G01 E2   242.8 80.3 313   84.748      401
##  G02 E2   330.9 80.3 313  172.868      489
##  G03 E2   212.0 80.3 313   53.925      370
##  G04 E2   128.7 80.3 313  -29.357      287
##  G05 E2   105.2 80.3 313  -52.862      263
##  G06 E2   167.6 80.3 313    9.513      326
##  G07 E2   188.6 80.3 313   30.555      347
##  G08 E2   482.5 80.3 313  324.488      641
##  G09 E2   439.7 80.3 313  281.643      598
##  G10 E2   196.5 80.3 313   38.418      355
##  G11 E2   353.8 80.3 313  195.703      512
##  G12 E2   505.9 80.3 313  347.888      664
##  G13 E2   643.5 80.3 313  485.455      802
##  G14 E2   323.9 80.3 313  165.860      482
##  G15 E2   440.9 80.3 313  282.893      599
##  G16 E2   268.6 80.3 313  110.585      427
##  G17 E2   293.6 80.3 313  135.495      452
##  G18 E2   260.2 80.3 313  102.173      418
##  G01 E3   581.9 80.3 313  423.800      740
##  G02 E3   515.3 80.3 313  357.285      673
##  G03 E3   804.9 80.3 313  646.875      963
##  G04 E3   749.3 80.3 313  591.200      907
##  G05 E3   642.9 80.3 313  484.853      801
##  G06 E3   314.6 80.3 313  156.595      473
##  G07 E3   937.1 80.3 313  779.005     1095
##  G08 E3   741.4 80.3 313  583.328      899
##  G09 E3   813.2 80.3 313  655.105      971
##  G10 E3   738.5 80.3 313  580.463      897
##  G11 E3   728.1 80.3 313  570.088      886
##  G12 E3   539.0 80.3 313  380.928      697
##  G13 E3   825.0 80.3 313  666.975      983
##  G14 E3   582.9 80.3 313  424.860      741
##  G15 E3   646.6 80.3 313  488.520      805
##  G16 E3   604.7 80.3 313  446.608      763
##  G17 E3   285.2 80.3 313  127.143      443
##  G18 E3  1030.1 80.3 313  872.078     1188
##  G01 E4   294.1 80.3 313  136.050      452
##  G02 E4   639.8 80.3 313  481.710      798
##  G03 E4   272.4 80.3 313  114.368      430
##  G04 E4   393.5 80.3 313  235.475      552
##  G05 E4   456.9 80.3 313  298.853      615
##  G06 E4   146.6 80.3 313  -11.422      305
##  G07 E4   524.2 80.3 313  366.158      682
##  G08 E4   419.1 80.3 313  261.000      577
##  G09 E4   353.0 80.3 313  194.903      511
##  G10 E4   659.0 80.3 313  500.925      817
##  G11 E4   640.3 80.3 313  482.235      798
##  G12 E4   295.4 80.3 313  137.363      453
##  G13 E4   590.9 80.3 313  432.830      749
##  G14 E4   401.3 80.3 313  243.255      559
##  G15 E4   481.3 80.3 313  323.265      639
##  G16 E4   320.7 80.3 313  162.615      479
##  G17 E4   996.3 80.3 313  838.290     1154
##  G18 E4   670.7 80.3 313  512.675      829
##  G01 E5   197.0 80.3 313   38.915      355
##  G02 E5   157.9 80.3 313   -0.125      316
##  G03 E5   221.4 80.3 313   63.345      379
##  G04 E5   121.4 80.3 313  -36.667      279
##  G05 E5   207.4 80.3 313   49.335      365
##  G06 E5   191.2 80.3 313   33.153      349
##  G07 E5   159.1 80.3 313    1.045      317
##  G08 E5   130.9 80.3 313  -27.165      289
##  G09 E5   115.2 80.3 313  -42.815      273
##  G10 E5   398.4 80.3 313  240.328      556
##  G11 E5   210.7 80.3 313   52.615      369
##  G12 E5   169.3 80.3 313   11.248      327
##  G13 E5   341.0 80.3 313  182.935      499
##  G14 E5   203.7 80.3 313   45.625      362
##  G15 E5   210.7 80.3 313   52.645      369
##  G16 E5   122.1 80.3 313  -35.925      280
##  G17 E5   210.7 80.3 313   52.615      369
##  G18 E5    78.0 80.3 313  -80.055      236
##  G01 E6   836.7 80.3 313  678.655      995
##  G02 E6  1220.6 80.3 313 1062.568     1379
##  G03 E6   902.3 80.3 313  744.228     1060
##  G04 E6   814.5 80.3 313  656.398      973
##  G05 E6  1180.0 80.3 313 1021.960     1338
##  G06 E6  1170.3 80.3 313 1012.258     1328
##  G07 E6  1156.6 80.3 313  998.588     1315
##  G08 E6  1238.7 80.3 313 1080.618     1397
##  G09 E6  1312.5 80.3 313 1154.440     1471
##  G10 E6  1267.4 80.3 313 1109.325     1425
##  G11 E6  1022.5 80.3 313  864.448     1181
##  G12 E6  1338.7 80.3 313 1180.685     1497
##  G13 E6  1220.6 80.3 313 1062.568     1379
##  G14 E6  1649.5 80.3 313 1491.483     1808
##  G15 E6  1749.9 80.3 313 1591.870     1908
##  G16 E6   918.8 80.3 313  760.695     1077
##  G17 E6  1323.4 80.3 313 1165.380     1481
##  G18 E6   992.6 80.3 313  834.523     1151
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95

Practical Example cont’d

Run a two-way ANOVA (genotype and environment) considering these two factors and the interaction as random.

model.S.R <- lmer( yield ~ (1|gen) + (1|env) + (1|gen:env) + (1|env:rep),  omer.S) 

summary(model.S.R)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: yield ~ (1 | gen) + (1 | env) + (1 | gen:env) + (1 | env:rep)
##    Data: omer.S
## 
## REML criterion at convergence: 5785.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7533 -0.3550 -0.0211  0.2433  7.1388 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  gen:env  (Intercept)  21343   146.09  
##  env:rep  (Intercept)   1152    33.94  
##  gen      (Intercept)   1169    34.19  
##  env      (Intercept) 149280   386.37  
##  Residual              24660   157.03  
## Number of obs: 432, groups:  gen:env, 108; env:rep, 24; gen, 18; env, 6
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)  
## (Intercept)  496.113    158.895   5.027   3.122    0.026 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimate the combined heritability

\(h^2= \frac{σ_g^2}{{σ_g^2+ \frac{σ_{g*e}^2}{e}+\frac{σ_e^2}{e*r}}}\)

\(h^2= \frac{1169}{1169+(21343/6)+(24660/(4*6))} = 0.20\)

Multivariate Analysis

PCA (Principle Component Analysis)

PCA is used for exploratory data analysis, allowing to better visualize the variation present in a data set with many variables.

It involves the transformation of large data into a smaller one that still contains most of the principal components.
The principal components are the directions where there is the most variance.
An eigenvector is a direction which can be horizontal or a certain degree while an eigenvalue is a number indicating the variance of the data in that direction.
The data used to conduct PCA is not usually raw data, because the treatments should not have replicates.
Library used factoextra , stats , ggbiplot

Practical Example

library(factoextra)
library(ggbiplot)
data <- read_csv("data/australia.soybean.csv")

## Rows: 464 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): env, loc, gen
## dbl (7): year, yield, height, lodging, size, protein, oil
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data.brookstead <- data %>% 
  filter(loc== "Brookstead")
str(data.brookstead)

## spc_tbl_ [116 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ env    : chr [1:116] "B70" "B70" "B70" "B70" ...
##  $ loc    : chr [1:116] "Brookstead" "Brookstead" "Brookstead" "Brookstead" ...
##  $ year   : num [1:116] 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
##  $ gen    : chr [1:116] "G01" "G02" "G03" "G04" ...
##  $ yield  : num [1:116] 1.253 1.167 0.468 1.445 1.338 ...
##  $ height : num [1:116] 1.01 1.13 1.16 1.24 1.12 ...
##  $ lodging: num [1:116] 3.25 2.75 2.25 1.5 2 2.25 2 2.25 1.75 2 ...
##  $ size   : num [1:116] 8.85 8.9 10.8 10.6 11.95 ...
##  $ protein: num [1:116] 39.5 38.6 37.8 38.7 37.8 ...
##  $ oil    : num [1:116] 18.9 19.8 20.4 20.4 20.8 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   env = col_character(),
##   ..   loc = col_character(),
##   ..   year = col_double(),
##   ..   gen = col_character(),
##   ..   yield = col_double(),
##   ..   height = col_double(),
##   ..   lodging = col_double(),
##   ..   size = col_double(),
##   ..   protein = col_double(),
##   ..   oil = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

labels.gen.b <- data.brookstead$gen
labels.year.b <- data.brookstead$year

data.brookstead <- data.brookstead %>% 
  select(-c(env, loc,gen, year))

data.brookstead.pca <- prcomp(data.brookstead, center = TRUE, scale = TRUE)
summary(data.brookstead.pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.8780 0.9949 0.9176 0.60462 0.42702 0.30580
## Proportion of Variance 0.5878 0.1650 0.1403 0.06093 0.03039 0.01559
## Cumulative Proportion  0.5878 0.7528 0.8931 0.95402 0.98441 1.00000

fviz_eig(data.brookstead.pca, addlabels = TRUE)

ggbiplot(data.brookstead.pca, labels = labels.gen.b, groups = labels.year.b)+
  labs(title = "Yield and Other Traits of soybeans in Brookstead Australia",
       subtitle = "Period 1970-1971",
       caption = "Data: agridat::australia.soybean")

Hierarchial Cluster Analysis

Hierarchical classification algorithm allows a set of individuals to be grouped into subsets.

Objective

To create coherent clusters which are clearly different from each other.
Individuals in a cluster should be as homogeneous as possible.

Methods of Hierarchial Clustering

Agglomerative: sequential grouping of similar clusters.
Divisive: grouping of all the observation together into one cluster and the successively splitting into smaller clusters.

The similarities between clusters are often calculated using distance metrics . The greater the distance between two clusters the better. After distance metric we choose the linkage criteria.

Libraries used factoextra, dendextend

Example

library(dendextend)
library(factoextra)

dat <- read_csv("data/steptoe.morex.pheno.csv")

## Rows: 2432 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gen, env
## dbl (8): amylase, diapow, hddate, lodging, malt, height, protein, yield
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(dat)

##      gen                env               amylase          diapow      
##  Length:2432        Length:2432        Min.   :14.90   Min.   : 35.00  
##  Class :character   Class :character   1st Qu.:25.62   1st Qu.: 70.00  
##  Mode  :character   Mode  :character   Median :28.50   Median : 84.00  
##                                        Mean   :29.04   Mean   : 87.25  
##                                        3rd Qu.:32.00   3rd Qu.:101.00  
##                                        Max.   :49.30   Max.   :229.00  
##                                        NA's   :1066    NA's   :1066    
##      hddate         lodging            malt           height      
##  Min.   :143.0   Min.   :  0.00   Min.   :69.00   Min.   : 34.00  
##  1st Qu.:174.5   1st Qu.: 15.00   1st Qu.:73.30   1st Qu.: 82.50  
##  Median :183.5   Median : 35.00   Median :74.50   Median : 95.00  
##  Mean   :181.2   Mean   : 37.13   Mean   :74.72   Mean   : 94.95  
##  3rd Qu.:191.0   3rd Qu.: 55.00   3rd Qu.:75.90   3rd Qu.:109.00  
##  Max.   :217.0   Max.   :100.00   Max.   :83.00   Max.   :151.00  
##                  NA's   :1521     NA's   :1067    NA's   :5       
##     protein          yield       
##  Min.   : 8.00   Min.   : 1.390  
##  1st Qu.:11.90   1st Qu.: 4.092  
##  Median :13.00   Median : 5.272  
##  Mean   :12.92   Mean   : 5.293  
##  3rd Qu.:14.00   3rd Qu.: 6.338  
##  Max.   :17.50   Max.   :11.526  
##  NA's   :1067

unique(dat$env)

##  [1] "MN92"  "MTi92" "MTd92" "ID91"  "OR91"  "WA91"  "MTi91" "MTd91" "NY92" 
## [10] "ON92"  "WA92"  "MA92"  "SKo92" "SKg92" "SKk92" "ID92"

dat.ID91 <- dat %>% 
  filter(env== "ID91")
summary(dat.ID91)

##      gen                env               amylase          diapow      
##  Length:152         Length:152         Min.   :19.10   Min.   : 47.00  
##  Class :character   Class :character   1st Qu.:25.48   1st Qu.: 72.00  
##  Mode  :character   Mode  :character   Median :28.05   Median : 83.00  
##                                        Mean   :28.08   Mean   : 84.16  
##                                        3rd Qu.:30.12   3rd Qu.: 94.00  
##                                        Max.   :40.10   Max.   :144.00  
##                                                                        
##      hddate         lodging         malt           height         protein     
##  Min.   :173.0   Min.   : NA   Min.   :71.10   Min.   : 92.7   Min.   :11.30  
##  1st Qu.:178.0   1st Qu.: NA   1st Qu.:73.40   1st Qu.:108.9   1st Qu.:12.90  
##  Median :180.5   Median : NA   Median :74.40   Median :114.3   Median :13.50  
##  Mean   :180.5   Mean   :NaN   Mean   :74.36   Mean   :114.0   Mean   :13.57  
##  3rd Qu.:183.0   3rd Qu.: NA   3rd Qu.:75.20   3rd Qu.:119.4   3rd Qu.:14.22  
##  Max.   :188.0   Max.   : NA   Max.   :78.50   Max.   :137.2   Max.   :16.00  
##                  NA's   :152                                                  
##      yield       
##  Min.   : 4.048  
##  1st Qu.: 6.681  
##  Median : 7.684  
##  Mean   : 7.500  
##  3rd Qu.: 8.382  
##  Max.   :10.315  
##

dat.ID91 <- dat.ID91 %>% 
  select(-lodging)
names(dat.ID91)

## [1] "gen"     "env"     "amylase" "diapow"  "hddate"  "malt"    "height" 
## [8] "protein" "yield"

Example cont’d

dat.ID91 <- na.omit(dat.ID91)
dat.ID91.label <- dat.ID91$gen
dat.ID91 <- dat.ID91 %>% 
  select(-c(gen, env))
dat.ID91

## # A tibble: 152 × 7
##    amylase diapow hddate  malt height protein yield
##      <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl> <dbl>
##  1    26.2     74   179   74.1   111     12.1  8.63
##  2    36.2     97   180   76.7   116     14.8  7.95
##  3    32.8     83   183   74.4   117.    14.3  6.10
##  4    24.9     81   178.  72.3   113     14    7.20
##  5    30.2     69   180   72.9   117.    13.8  6.01
##  6    30.3     99   184.  73.2   119.    14.9  8.09
##  7    33.4     63   178   73.5   104.    13.1  7.98
##  8    25.5     74   186.  71.8   130.    14.6  5.93
##  9    26.8     78   185   75     124.    12.8  8.72
## 10    28.3     85   179   71.3   119.    14.1  6.48
## # ℹ 142 more rows

dat.ID91.sc <- scale(dat.ID91)

summary(dat.ID91.sc)

##     amylase              diapow             hddate              malt         
##  Min.   :-2.323173   Min.   :-2.13347   Min.   :-2.13124   Min.   :-2.39465  
##  1st Qu.:-0.674579   1st Qu.:-0.69806   1st Qu.:-0.71724   1st Qu.:-0.70544  
##  Median :-0.008677   Median :-0.06648   Median :-0.01023   Median : 0.02899  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.527924   3rd Qu.: 0.56510   3rd Qu.: 0.69677   3rd Qu.: 0.61654  
##  Max.   : 3.107488   Max.   : 3.43592   Max.   : 2.11078   Max.   : 3.04018  
##      height            protein             yield        
##  Min.   :-2.38931   Min.   :-2.44921   Min.   :-2.7274  
##  1st Qu.:-0.57607   1st Qu.:-0.72269   1st Qu.:-0.6471  
##  Median : 0.02835   Median :-0.07525   Median : 0.1461  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.59919   3rd Qu.: 0.70708   3rd Qu.: 0.6974  
##  Max.   : 2.59152   Max.   : 2.62243   Max.   : 2.2246

fviz_nbclust(dat.ID91.sc, kmeans, method = "wss")

datpca <- prcomp(dat.ID91.sc, scale = TRUE)
fviz_eig(datpca, addlabels = TRUE)

Experimental Designs, Linear Mixed Models and Multivariate Analysis

Advanced Data Wrangling

Pivoting

pivot_longer()

pivot_wider()

Combining Tables

Practical Example

Creating Groups

Combining and Subsetting Strings

Practical Example

Design of Experiments

Features of an Experimental Design

Important Principles of Experimental Designs.

Choosing the Right Experiment

Randomized Complete Block Design

Features of RCBD

Practical Example

Visualization of the plot using desplot

Plot Using agricolaeplotr

Splitplot Designs

Example

Visualization with desplot

Visualization with agricolaeplotr

Incomplete Block Designs

Latin Square Design

Visualization using desplot

Visualization using agricolaeplotr

Lattice Design

Visualization using desplot

Visualization using agricolaeplotr

Augmented Design

Features of Augmented Designs.

Practical Example

Visualization using desplot

Visualization using agricolaeplotr

Alpha designs

Pratical Example

Visualization using desplot

Visualization using agricolae plotr

Partially Replicated Designs

Practical Example

Visualization using Desplot

Visualization using agricolaeplotr

Linear Models: Analysis of Variance

Principle of Analysis of Variance

Practical Example

One way ANOVA

One way ANOVA example cont’d

Estimate the adjusted means

What genotypes are different to G05?

Two way ANOVA

Run a two-way ANOVA (genotype and environment) considering these two factors and the interaction as fixed.

Practical Example cont’d

Estimate the combined heritability

Multivariate Analysis

PCA (Principle Component Analysis)

Practical Example

Hierarchial Cluster Analysis

Objective

Methods of Hierarchial Clustering

Example

Example cont’d

Example cont’d

Kmeans Clustering

Example

THANK YOU