This article documents the statistical analysis performed on the yield data. The analysis consists of two parts. The first part summarizes the data and the second part compares a bunch of models used to predict the yield. We use three clsses of models: Linear Regression, KNN regression and Gradient Boosted Model. We also compare the Raw Yield against a Lypholized Yield (benchmark is adjusted for the temperature) and we compare the Raw Yield against the aqueous benchmark. ## ** Data Summary **
## For news about 'ggpmisc', please, see https://www.r4photobiology.info/
In this section, I analyze how well different dependent variables can be predicted. For example, consider the setting where Temperature is -80C and Trehalose is 30. There were 4 replicates and in each one of them, the Raw yield was different. The difference is attributed to so-called experimental error. The theoretical limit, in some sense, shows the level of experimental error.
This plot shows that the theoretical limit of R^2 is approximately 96%. I also like that the intercept is 0 and the slope is 1 which means that there is no bias.
Based on the two plots above, there is not much different between predicting relative and raw yields. Note that the Figure 2 will be the same if the benchmark was set to Aqueous.
Last note that the best RMSE when one uses relative yield is 0.09 and the best when one uses RaW yield is 33.6. These numbers will serve as a benchmark for which model to use.
In this sense, I divided the data in two pieces: the piece where raw yield is above the benchmark (Lypho) and the piece where it does not. I repeat the previous step but by only considering the first piece of data.
This plot is concerning. For reasons that I do not understand, the theoretical limit decreases to 87%. That is, for some reason, there was more experimental error in experiments where Raw yield beat the Lympho benchmark. More importantly, it seems that the estimates are biased. The intercept is 0.04 and the slope is 0.91. This plot implies that the Lympho benchmark may not be appropriate.
This plot is more inline with what I expected. The theoretical limit does not change from before. The intercept of 3 is very low relative to average of 483 which means that it is effectively unbiased. The slope of 1 also helps. This plot implies that the Aqueous benchmark may be appropriate.