DAT301 - Project 1

2023-09-24

ELISA Lot and Run Analysis

Analysis steps based on set up from Bio415 Statistical Modeling for Biology
ELISA Validation dataset like would be used for validation.
Null Hypothesis There will be no statistical difference in the optical density between the different lots or runs
Alternative Hypothesis There will be a statistical difference
Significance Level alpha = 0.05
From NIST Education & Training: Data Sets https://www.itl.nist.gov/div898/education/datasets.htm#anova

Read in Data

# Read in the data
elisa_data <- read_csv(
  "ELISA_Validation_Test_Data.csv", show_col_types = FALSE)

# Lot & Run read in as numerical values, 
# Convert the two columns into factor data
elisa_data$Lot <- as.factor(elisa_data$Lot)
elisa_data$Run <- as.factor(elisa_data$Run)

Statistical Analysis of Lots:

Evaluate the 5 different lots optical density results to see if there is a statistical difference between them.

Test Assumptions - Equal Variance & Normal Distribution

tapply() used to graph a stripchart

Levene’s Test result:

p-value = 0.2931 is greater than significance level 0.05
Therefore: Equal Variance seen across the lots

# Fit a linear model.
lot_model <- lm(Density ~ Lot, data = elisa_data)
# Residuals from the fitted linear model.  
elisa_data$res <- residuals(lot_model)
# Levene's Test for equal variance
leveneTest(res ~ Lot, data = elisa_data)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  4  1.2622 0.2931
      70

QQPlot

# Normal Probability plot
qqnorm(elisa_data$res, pch = 1, col = "steelblue", frame = FALSE)
qqline(elisa_data$res, col = "darkred", lwd = 2)

Shapiro-Wilk’s Test result:

p-value = 0.0001839 is less than significance level 0.05
Therefore: Normal Distribution is not seen in the data

  # Shapiro-Wilk test for normal distribution
shapiro.test(elisa_data$res)

    Shapiro-Wilk normality test

data:  elisa_data$res
W = 0.92148, p-value = 0.0001839

Histogram:

Data does show a slight skew. While ANOVA can except data that is slightly off of normal distribution, a transformation to more normal would be better to prevent false positives.

Data Transformation

The data is not normally distributed but does show equal variance. Based on the skew, a sqrt transformation attempted to correct distribution

# Transform Data with sqrt transformation
elisa_data$Density_sqrt <- sqrt(elisa_data$Density)
# Fit a linear model on transformed data
lot_model_sqrt<- lm(Density_sqrt ~ Lot, data = elisa_data)
# Residuals from the fitted linear model of transformed data
elisa_data$res_sqrt <- residuals(lot_model_sqrt)

Levene’s Test result for Transformed data:

p-value = 0.2575 is greater than significance level 0.05
Therefore: Equal Variance still seen across the lots

# Levene's Test for equal variance for transformed data
leveneTest(res_sqrt ~ Lot, data = elisa_data)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  4  1.3577 0.2575
      70

Shapiro-Wilk’s Test result for Transformed data:

p-value = 0.0003208 is less than significance level 0.05
Therefore: Normal Distribution is not seen in the transformed data.

# Shapiro-Wilk test for normal distribution of Transformed data
shapiro.test(elisa_data$res_sqrt)

    Shapiro-Wilk normality test

data:  elisa_data$res_sqrt
W = 0.92665, p-value = 0.0003208

Non-Parametric Test is indicated due to failed Shapiro-Wilk test for all attempted transformations.

_ Sqrt
- Log10
- Log2
- Turkey
- Box Cox Lambda 2
- Box Cox Lambda 6

Non-Parametric test - Kruskal-Wallis Test

p-value = 0.001793 is less than significance level 0.05.
Therefor it can be concluded that there is significant difference between the lots

# Kruskal-Wallis test 
kruskal.test(Density ~ Lot, data = elisa_data)

    Kruskal-Wallis rank sum test

data:  Density by Lot
Kruskal-Wallis chi-squared = 17.168, df = 4, p-value = 0.001793

TukeyHSD

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = x)

$Lot
           diff         lwr          upr     p adj
2-1  0.10620000 -0.06300149  0.275401486 0.4063682
3-1  0.06073333 -0.10846815  0.229934820 0.8521828
4-1  0.03560000 -0.13360149  0.204801486 0.9762779
5-1 -0.11686667 -0.28606815  0.052334820 0.3093379
3-2 -0.04546667 -0.21466815  0.123734820 0.9430949
4-2 -0.07060000 -0.23980149  0.098601486 0.7691944
5-2 -0.22306667 -0.39226815 -0.053865180 0.0038869
4-3 -0.02513333 -0.19433482  0.144068153 0.9935914
5-3 -0.17760000 -0.34680149 -0.008398514 0.0349390
5-4 -0.15246667 -0.32166815  0.016734820 0.0970768

Proposed question:

Is there a difference in the runs that is possibly at fault of the lot issue? Run the same analysis set up - but using “Run” in place of “Lot”

Test Assumptions - Equal Variance & Normal Distribution

tapply() used to graph a stripchart

Levene’s Test result for Run:

p-value = 0.332 is greater than significance level 0.05
Therefore: Equal Variance still seen across the runs

# Fit a linear model.
run_model <- lm(Density ~ Run, data = elisa_data)
# Residuals from the fitted linear model.  
elisa_data$res2 <- residuals(run_model)
# Levene's Test for equal variance
leveneTest(res2 ~ Run, data = elisa_data)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  4   1.169  0.332
      70

QQ Plot

# Normal Probability plot for transformed data
qqnorm(elisa_data$res2, pch = 1, col = "steelblue", frame = FALSE)
qqline(elisa_data$res2, col = "darkred", lwd = 2)

Shapiro-Wilk’s Test result:

p-value = 0.005266 is less than significance level 0.05
Therefore: Normal Distribution is not seen in the data.

  # Shapiro-Wilk test for normal distribution
shapiro.test(elisa_data$res2)

    Shapiro-Wilk normality test

data:  elisa_data$res2
W = 0.95043, p-value = 0.005266

Non-Parametric test - Kruskal-Wallis Test

p-value = 0.1015 is greater than significance level 0.05.
Therefor it can be concluded that there is not a significant difference between the runs.

# Kruskal-Wallis test 
kruskal.test(Density ~ Run, data = elisa_data)

    Kruskal-Wallis rank sum test

data:  Density by Run
Kruskal-Wallis chi-squared = 7.7415, df = 4, p-value = 0.1015

TukeyHSD

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = x)

$Run
           diff         lwr          upr     p adj
2-1 -0.18000000 -0.35443685 -0.005563149 0.0397894
3-1 -0.08320000 -0.25763685  0.091236851 0.6701621
4-1 -0.02393333 -0.19837018  0.150503517 0.9952772
5-1 -0.03453333 -0.20897018  0.139903517 0.9810486
3-2  0.09680000 -0.07763685  0.271236851 0.5316030
4-2  0.15606667 -0.01837018  0.330503517 0.1010917
5-2  0.14546667 -0.02897018  0.319903517 0.1461686
4-3  0.05926667 -0.11517018  0.233703517 0.8755777
5-3  0.04866667 -0.12577018  0.223103517 0.9352236
5-4 -0.01060000 -0.18503685  0.163836851 0.9998074

Boxplot of Run data to visualise Tukey Results

The Box plot of the runs shows the outlier in run one and the elongated range of data in run two in comparison to the other runs.

Final Graphs for ELISA Dataset Analysis

Statistcal Representation of Lots

Statistics <- grobTree(textGrob(
  "Kruskal-Wallis: chi-squared = 17.168, df = 4, p-value = 0.001793",
  x=0.01,  y=0.985, hjust=0,
  gp=gpar(col="black", fontsize=10, fontface="italic")))

#Violin plots of Optical Density by Lot
Lots_Plot <- ggplot(data = elisa_data, 
        aes(x = Lot, y = Density, fill = Lot)) + 
  geom_violin(trim=FALSE) + theme_classic() +
  geom_boxplot(width=0.1, color = "black", fill = "darkred")+
  scale_fill_manual(values=c("steelblue", "darkblue", 
                    "#56B4E9", "lightblue", "mediumblue"))+
  theme(legend.position="none")+
  labs(title = "Violin Plot of Lot Optical Densities") +
  annotation_custom(Statistics)

Statistcal Representation of Runs

Statistics2 <- grobTree( textGrob(
 "Kruskal-Wallis: chi-squared = 7.7415, df = 4, p-value = 0.1015", 
           x=0.01, y=0.985, hjust=0,
  gp=gpar(col="black", fontsize=10, fontface="italic")))

#Violin plots of Optical Density by Run
Runs_Plot <- ggplot(data = elisa_data, 
                    aes(x = Run, y = Density, fill = Run)) + 
  geom_violin(trim=FALSE) + theme_classic() +
  geom_boxplot(width=0.1, color = "black", fill = "darkred")+
  scale_fill_manual(values=c("chartreuse4", "darkgreen", 
         "darkolivegreen", "darkseagreen4", "forestgreen"))+
  theme(legend.position="none")+
  labs(title = "Violin Plot of Run Optical Densities") +
  annotation_custom(Statistics2)