In this project, a High Throughput Screening Experiment is studied. 60 plates are analyzed, where 96 different strains are present in each of the plates. Out of 96 strains, the aim is to find improved strains via analyzing different variables affecting strains. Well type, strain location are some of the important factors affecting strains.

In this experimental setup, 20, 60 and 180 hits will be identified based on their priorities.

In a primary screen without replicates, every compound is measured only once. Consequently, we cannot directly estimate the data variability for each compound. Instead, we indirectly estimate data variability by making a strong assumption that every compound has the same variability as a negative reference in a plate in the screen.

#install.packages("readxl")
#install.packages("remotes")
#install.packages("platetools")

library(platetools)
library(readxl)
library(phenoScreen)
library(ggplot2)
library(tibble)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(GAD)
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following object is masked from 'package:dplyr':
## 
##     count
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.1 (2020-08-26 16:20:06 UTC) successfully loaded. See ?R.methodsS3 for help.
df <- read_excel("Sample_T1_Data.xlsx")
head(df)
## # A tibble: 6 x 7
##   counter plate strain well_type     robot  stacker_position  value
##     <dbl> <dbl> <chr>  <chr>         <chr>             <dbl>  <dbl>
## 1       1     1 X1     Standard Well Bender                1 10236.
## 2       2     1 X2     Standard Well Bender                1 10370.
## 3       3     1 X3     Standard Well Bender                1  9651.
## 4       4     1 X4     Standard Well Bender                1 10172.
## 5       5     1 X5     Standard Well Bender                1  9493.
## 6       6     1 X6     Standard Well Bender                1 10216.

Initial Observation of Values via Histogram

hist(df$value, breaks = 50)

Summary of the dataset.

summary(df)
##     counter         plate          strain           well_type        
##  Min.   :   1   Min.   : 1.00   Length:5760        Length:5760       
##  1st Qu.:1441   1st Qu.:15.75   Class :character   Class :character  
##  Median :2880   Median :30.50   Mode  :character   Mode  :character  
##  Mean   :2880   Mean   :30.50                                        
##  3rd Qu.:4320   3rd Qu.:45.25                                        
##  Max.   :5760   Max.   :60.00                                        
##     robot           stacker_position     value      
##  Length:5760        Min.   : 1.00    Min.   : 5970  
##  Class :character   1st Qu.: 5.75    1st Qu.: 8384  
##  Mode  :character   Median :10.50    Median : 8908  
##                     Mean   :10.50    Mean   : 8885  
##                     3rd Qu.:15.25    3rd Qu.: 9443  
##                     Max.   :20.00    Max.   :11194

Transforming each variable to factors to prepare the dataset for experimental design setup.

In this project, an experimental factorial design setup will be implemented.

df$a <- factor(df$plate, levels=unique(df$plate))
df$b <- factor(df$well_type, levels=unique(df$well_type))
df$c <- factor(df$robot, levels=unique(df$robot))
df$d <- factor(df$stacker_position, levels=unique(df$stacker_position))

Boxplot of entire set of factors affecting strain values.

It can be seen that all factors are potentially impactful in strain value.

boxplot(value ~ plate * well_type * robot * stacker_position, data=df, 
        ylab="value", main="value, plate*well_type*robot*stacker_position")

Initial Observation of Values via Histogram

Analysis of Variance results show that, each value is being highly impacted by their position within the plate.

aov.result <- aov(value ~ robot*stacker_position, data=df)
aov.result
## Call:
##    aov(formula = value ~ robot * stacker_position, data = df)
## 
## Terms:
##                      robot stacker_position robot:stacker_position  Residuals
## Sum of Squares      132042       1850724747                 354018 1211119566
## Deg. of Freedom          2                1                      2       5754
## 
## Residual standard error: 458.7843
## Estimated effects may be unbalanced
print(summary(aov.result))
##                          Df    Sum Sq   Mean Sq  F value Pr(>F)    
## robot                     2 1.320e+05 6.602e+04    0.314  0.731    
## stacker_position          1 1.851e+09 1.851e+09 8792.749 <2e-16 ***
## robot:stacker_position    2 3.540e+05 1.770e+05    0.841  0.431    
## Residuals              5754 1.211e+09 2.105e+05                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

While, we know that, within each plate, some strains are naturally higher than others, which are parent strains and in each plate, parent strains are in the same spot. So, in order to remove this effect, within each plate, each strain will be subtracted from parent strains. Then, to remove any effect due to spatial positioning, spatial normalization will be applied.

Analysis of residuals

There is no unusual situation observed for residuals.

plotter_df <- data.frame(aov.result$residuals,
                         aov.result$fitted.values,
                         df$robot,
                         df$stacker_position)

ggplot(plotter_df, aes(x=aov.result$residuals, y=aov.result$fitted.values)) + geom_point()

ggplot(plotter_df, aes(x=df$robot, y=aov.result$residuals)) + geom_point()

ggplot(plotter_df, aes(x=df$stacker_position, y=aov.result$residuals)) + geom_point()

Changing values based on Robot type & stacker position

gg_1 <- ggplot(df, aes(x=value, y=stacker_position)) + 
  geom_point(aes(col=robot), size=0.4) + 
  geom_smooth(method="loess", se=F) +
  labs(subtitle="Changing values based on Robot type & stacker position", 
       y="stacker_position", 
       x="value", 
       title="Scatterplot")
plot(gg_1)
## `geom_smooth()` using formula 'y ~ x'

A process to take difference of each strain with respect to parent strains in each plate to identify raw difference

datalist = vector("list", length = 60)

for (i in 1:60){
  df_copy = df[(1+96*(i-1)):(96+96*(i-1)),]
  df_copy %>% add_column(max_parent_strain = NA)
  df_copy$max_parent_strain = max(df_copy$value[89:92])
  df_copy$iteration_id <- i
  datalist[[i]] <- df_copy
}
df_appended <- dplyr::bind_rows(datalist)

Process to subtract values of each plate from max strain of respective plate

df_appended %>% add_column(value_difference = NA)
## # A tibble: 5,760 x 14
##    counter plate strain well_type robot stacker_position  value a     b    
##      <dbl> <dbl> <chr>  <chr>     <chr>            <dbl>  <dbl> <fct> <fct>
##  1       1     1 X1     Standard… Bend…                1 10236. 1     Stan…
##  2       2     1 X2     Standard… Bend…                1 10370. 1     Stan…
##  3       3     1 X3     Standard… Bend…                1  9651. 1     Stan…
##  4       4     1 X4     Standard… Bend…                1 10172. 1     Stan…
##  5       5     1 X5     Standard… Bend…                1  9493. 1     Stan…
##  6       6     1 X6     Standard… Bend…                1 10216. 1     Stan…
##  7       7     1 X7     Standard… Bend…                1  9885. 1     Stan…
##  8       8     1 X8     Standard… Bend…                1 10450. 1     Stan…
##  9       9     1 X9     Standard… Bend…                1 10154. 1     Stan…
## 10      10     1 X10    Standard… Bend…                1 10616. 1     Stan…
## # … with 5,750 more rows, and 5 more variables: c <fct>, d <fct>,
## #   max_parent_strain <dbl>, iteration_id <int>, value_difference <lgl>
df_appended$value_difference = df_appended$value-df_appended$max_parent_strain

Removing parent strain values with mean value differences to remove unnecessary bias

for (i in 1:60){
  df_appended$value_difference[(89+96*(i-1)):(92+96*(i-1))] = mean(df_appended$value_difference[(1+96*(i-1)):(88+96*(i-1))])
  # Capturing the mean of child strains and changing parent strain values with those
}
head(df_appended)
## # A tibble: 6 x 14
##   counter plate strain well_type robot stacker_position  value a     b     c    
##     <dbl> <dbl> <chr>  <chr>     <chr>            <dbl>  <dbl> <fct> <fct> <fct>
## 1       1     1 X1     Standard… Bend…                1 10236. 1     Stan… Bend…
## 2       2     1 X2     Standard… Bend…                1 10370. 1     Stan… Bend…
## 3       3     1 X3     Standard… Bend…                1  9651. 1     Stan… Bend…
## 4       4     1 X4     Standard… Bend…                1 10172. 1     Stan… Bend…
## 5       5     1 X5     Standard… Bend…                1  9493. 1     Stan… Bend…
## 6       6     1 X6     Standard… Bend…                1 10216. 1     Stan… Bend…
## # … with 4 more variables: d <fct>, max_parent_strain <dbl>,
## #   iteration_id <int>, value_difference <dbl>

Building a linear regression model to understand which variables have impact in child strains’ values to eliminate variance caused by them in the next step.

Except the intercept, neither robot or stacker position are significant. As intercept refers to expected mean value of the response variable when Xs are 0, though these explanatory variables are never 0 and intercept is not rich to interpret on, in this case.

If we had built the model before subtracting parent strain maxes, we would have stacker position significant, but that does not mean anything since parent strains are always positioned in a specific spot and that causes bias.

model <- lm(df_appended$value_difference~df_appended$robot+
              df_appended$stacker_position)
summary(model)
## 
## Call:
## lm(formula = df_appended$value_difference ~ df_appended$robot + 
##     df_appended$stacker_position)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2550.29  -160.15    85.88   282.09  1668.17 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -338.720     16.475 -20.559   <2e-16 ***
## df_appended$robotC3P0          20.992     16.058   1.307    0.191    
## df_appended$robotTerminator     5.026     16.058   0.313    0.754    
## df_appended$stacker_position    1.408      1.137   1.239    0.215    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 497.5 on 5756 degrees of freedom
## Multiple R-squared:  0.00059,    Adjusted R-squared:  6.915e-05 
## F-statistic: 1.133 on 3 and 5756 DF,  p-value: 0.3343

ANOVA for the updated model, which is cleaned from spatial positioning.

model_aov <- aov(df_appended$value_difference~df_appended$robot+
              df_appended$stacker_position)
summary(model)
## 
## Call:
## lm(formula = df_appended$value_difference ~ df_appended$robot + 
##     df_appended$stacker_position)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2550.29  -160.15    85.88   282.09  1668.17 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -338.720     16.475 -20.559   <2e-16 ***
## df_appended$robotC3P0          20.992     16.058   1.307    0.191    
## df_appended$robotTerminator     5.026     16.058   0.313    0.754    
## df_appended$stacker_position    1.408      1.137   1.239    0.215    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 497.5 on 5756 degrees of freedom
## Multiple R-squared:  0.00059,    Adjusted R-squared:  6.915e-05 
## F-statistic: 1.133 on 3 and 5756 DF,  p-value: 0.3343

At this point, we can conduct B-Normalization directly since parent strains will not cause any sorts of bias and they are part of the cohort. While, in value_difference column, we store the target information: difference with respect to respective parent strain value. B-Score is a two way median polish to remove positional effects on the plate.

b_score(data=df_appended$value_difference[1:96],
        well = num_to_well(df_appended$counter[1:96]),
        plate = 96)
##    well     residual
## 1   A01    16.384657
## 2   A02   206.632405
## 3   A03  -249.285638
## 4   A04     7.241434
## 5   A05  -540.027634
## 6   A06    51.281109
## 7   A07   -14.657188
## 8   A08   196.112421
## 9   A09   -13.037946
## 10  A10   350.243587
## 11  A11  -662.762616
## 12  A12  -135.236455
## 13  B01   225.279860
## 14  B02     2.663684
## 15  B03    10.774985
## 16  B04  -238.210129
## 17  B05   -98.260884
## 18  B06  -197.256518
## 19  B07   -12.068298
## 20  B08  -201.903226
## 21  B09   528.306919
## 22  B10   204.134112
## 23  B11   128.071106
## 24  B12   135.236455
## 25  C01   196.796643
## 26  C02  -441.788462
## 27  C03   -10.774985
## 28  C04  -197.504341
## 29  C05   246.844605
## 30  C06  -224.912743
## 31  C07   229.617880
## 32  C08    59.515358
## 33  C09   251.759753
## 34  C10  -122.475117
## 35  C11    43.581559
## 36  C12  -316.989628
## 37  D01  -215.920513
## 38  D02    -2.663684
## 39  D03  -273.173966
## 40  D04   602.827998
## 41  D05  -499.019497
## 42  D06  -300.110976
## 43  D07    12.068298
## 44  D08    39.968774
## 45  D09  -796.877946
## 46  D10    13.160809
## 47  D11    95.068868
## 48  D12   154.720357
## 49  E01   -16.384657
## 50  E02     7.451727
## 51  E03  -109.218615
## 52  E04   235.812960
## 53  E05     5.421558
## 54  E06   -49.040856
## 55  E07   861.540667
## 56  E08    39.887829
## 57  E09   114.397300
## 58  E10   -13.160809
## 59  E11  -120.940642
## 60  E12  -198.491217
## 61  F01  -239.580604
## 62  F02  -197.362096
## 63  F03   648.024325
## 64  F04   636.723322
## 65  F05    52.708761
## 66  F06   856.412034
## 67  F07  -475.413503
## 68  F08  -650.882091
## 69  F09  -388.342542
## 70  F10   193.474687
## 71  F11   -43.581559
## 72  F12   398.473252
## 73  G01  -174.217252
## 74  G02  -110.694082
## 75  G03   283.589769
## 76  G04    -7.241434
## 77  G05    -5.421558
## 78  G06   166.258581
## 79  G07  -138.921576
## 80  G08  -162.634102
## 81  G09    13.037946
## 82  G10  -735.784505
## 83  G11   235.059741
## 84  G12   156.761265
## 85  H01   593.373654
## 86  H02   270.272852
## 87  H03    50.887169
## 88  H04  -414.321456
## 89  H05   180.838403
## 90  H06    49.040856
## 91  H07   315.072755
## 92  H08   -39.887829
## 93  H09 -1699.965000
## 94  H10 -2296.949137
## 95  H11 -2035.297894
## 96  H12 -1760.593347
b_map(data=df_appended$value_difference[1:96],
        well = num_to_well(df_appended$counter[1:96]),
        plate = 96)

Visualizing hit detection from sample plates.

"Plate #1"
## [1] "Plate #1"
bhit_map(data = df_appended$value_difference[1:96], 
         well = num_to_well(df_appended$counter[1:96]),
         plate = 96, threshold = 1)

"Plate #21"
## [1] "Plate #21"
bhit_map(data = df_appended$value_difference[(1+96*(21-1)):(96+96*(21-1))], 
         well = num_to_well(df_appended$counter[(1+96*(21-1)):(96+96*(21-1))-96*(21-1)]),
         plate = 96, threshold = 1)

"Plate #41"
## [1] "Plate #41"
bhit_map(data = df_appended$value_difference[(1+96*(41-1)):(96+96*(41-1))], 
         well = num_to_well(df_appended$counter[(1+96*(41-1)):(96+96*(41-1))-96*(41-1)]),
         plate = 96, threshold = 1)

Histogram of Value Differences with respect to plate’s max parent strain

hist(df_appended$value_difference, breaks = 50)

Plot of changing value_differences based on Robot type.

gg <- ggplot(df_appended, aes(x=value_difference, y=stacker_position)) + 
  geom_point(aes(col=robot), size=0.4) + 
  geom_smooth(method="loess", se=F) +
  labs(subtitle="Changing value_differences based on Robot type", 
       y="stacker_position", 
       x="value_difference", 
       title="Scatterplot") 

plot(gg)
## `geom_smooth()` using formula 'y ~ x'

20 HITS

hit_list = vector("list", length = 60)
for (i in 1:60){
  
  hit_map = bhit_map(data = df_appended$value_difference[(1+96*(i-1)):(96+96*(i-1))], 
                     well = num_to_well(df_appended$counter[(1+96*(i-1)):(96+96*(i-1))-96*(i-1)]),
                     plate = 96, threshold = 2.43)
  
  x=data.frame(hit_map$data)[hit_map$data$hit=='hit','well']
  hit_list[[i]] <- well_to_num(x)
  hit_list[[i]] <- hit_list[[i]]+96*(i-1)
}
hits_20 <- unlist(hit_list)
hits_20
##  [1]  992 1016 2067 2219 2251 2290 2524 2703 2854 3100 3301 3544 3636 3794 3850
## [16] 4453 4549 4924 5433 5508
hist(hits_20, breaks = 5)

20 HITS, Visualization

df_hits_20 = df_appended[hits_20,]

plot(ggplot(df_hits_20, aes(x=value_difference, y=stacker_position)) + 
  geom_point(aes(col=robot), size=6) +
  labs(subtitle="Changing value_differences based on Robot type", 
       y="stacker_position", 
       x="value_difference", 
       title="20 Hits Distribution"))

60 HITS

hit_list = vector("list", length = 60)
for (i in 1:60){
  
  hit_map = bhit_map(data = df_appended$value_difference[(1+96*(i-1)):(96+96*(i-1))], 
                     well = num_to_well(df_appended$counter[(1+96*(i-1)):(96+96*(i-1))-96*(i-1)]),
                     plate = 96, threshold = 1.78)
  
  x=data.frame(hit_map$data)[hit_map$data$hit=='hit','well']
  hit_list[[i]] <- well_to_num(x)
  hit_list[[i]] <- hit_list[[i]]+96*(i-1)
}

hits_60 <- unlist(hit_list)
hits_60
##  [1]   55   66  182  347  390  467  832  914  992 1016 1045 1095 1118 1381 1761
## [16] 1815 1857 2067 2219 2220 2251 2290 2293 2372 2524 2640 2703 2845 2854 2872
## [31] 2901 2966 2985 3100 3301 3416 3520 3521 3544 3606 3636 3794 3850 4447 4453
## [46] 4475 4549 4917 4924 5001 5039 5111 5417 5433 5502 5508 5560 5593 5717
hist(hits_20, breaks = 10)

60 HITS, Visualization

df_hits_60 = df_appended[hits_60,]
plot(ggplot(df_hits_60, aes(x=value_difference, y=stacker_position)) + 
       geom_point(aes(col=robot), size=6) +
       labs(subtitle="Changing value_differences based on Robot type", 
            y="stacker_position", 
            x="value_difference", 
            title="60 Hits Distribution"))

180 HITS

hit_list = vector("list", length = 60)
for (i in 1:60){
  
  hit_map = bhit_map(data = df_appended$value_difference[(1+96*(i-1)):(96+96*(i-1))], 
                     well = num_to_well(df_appended$counter[(1+96*(i-1)):(96+96*(i-1))-96*(i-1)]),
                     plate = 96, threshold = 1.325)
  
  x=data.frame(hit_map$data)[hit_map$data$hit=='hit','well']
  hit_list[[i]] <- well_to_num(x)
  hit_list[[i]] <- hit_list[[i]]+96*(i-1)
}

hits_180 <- unlist(hit_list)
hits_180
##   [1]   40   55   63   64   66   85  105  115  146  182  293  297  299  338  347
##  [16]  390  467  468  483  527  539  566  772  832  914  919  974  985  992  994
##  [31] 1006 1016 1045 1048 1095 1118 1135 1164 1212 1292 1381 1554 1562 1717 1761
##  [46] 1815 1828 1844 1857 1882 1904 1949 1956 2046 2067 2177 2219 2220 2251 2269
##  [61] 2290 2293 2332 2337 2372 2524 2545 2640 2674 2703 2728 2756 2845 2854 2869
##  [76] 2872 2889 2901 2924 2962 2963 2964 2966 2985 3000 3012 3044 3045 3081 3094
##  [91] 3100 3111 3181 3195 3243 3287 3301 3416 3422 3448 3503 3510 3520 3521 3535
## [106] 3538 3539 3544 3606 3636 3654 3794 3804 3850 3892 3898 4022 4135 4226 4287
## [121] 4300 4306 4321 4325 4350 4438 4447 4453 4454 4475 4481 4493 4549 4575 4635
## [136] 4712 4718 4787 4790 4836 4917 4924 5001 5039 5045 5080 5105 5111 5165 5174
## [151] 5225 5229 5264 5300 5320 5337 5360 5362 5392 5414 5417 5433 5434 5443 5458
## [166] 5459 5463 5495 5498 5502 5508 5535 5541 5560 5571 5576 5593 5645 5714 5717
hist(hits_180, breaks = 10)

180 HITS, Visualization

df_hits_180 = df_appended[hits_180,]
plot(ggplot(df_hits_180, aes(x=value_difference, y=stacker_position)) + 
       geom_point(aes(col=robot), size=4) +
       labs(subtitle="Changing value_differences based on Robot type", 
            y="stacker_position", 
            x="value_difference", 
            title="180 Hits Distribution"))