1 Introduction

Coping with a large volume of categorical and continuous data in the high-throughput phenotyping pipelines such as Inetrnational Mouse Phenotyping Consortium (IMPC) requires a robust automated statistical pipeline with minimal manual intervention.

OpenStats provides a variety of the statistical workflows for the identification of abnormal phenotypes with an emphasize on high-throughput data generating process. The embedded set of checks and on-fly fixes assures accurate and robust analyses for high-throughput data and a comprehensive output allows communicating the results to the downstream processes.

1.1 Install the package

OpenStats is fully written in R and can be installed using the standard install.package() command or directly from Github,

2 Package architecture

OpenStats follows a four-layer architecture namely, model and data input, preprocessing and data preparation, statistical analysis and reports. This is shown in the figure below. More details can be found in the appropriate sections of the User’s Guid.

The first layer allows a dataset for the input and an input model. In the second layer, the input data, as well as the model, are checked for feasibility and correctness and OpenStats reports any abnormality in the data or the input model. The successful output of this stage is an OpenStatsList object that is passed to the downstream layers.

The statistical analysis layer encompasses three analysis framework precisely, Linear Mixed Model (MM) and Reference range plus (RR) for the continuous and the Fisher’s exact test (FE) for the categorical data. They are extra checks and on-fly fixes in this stage to assure a successful, correct and comprehensive analysis is applied to the data. More details about this layer are provided in the corresponded section.

The final layer provides feed for the downstream processes and consists of summaries, plots, lists and JSON objects.

2.1 Input data and model

OpenStats allows a data frame for the input. The input model must be in the standard R formula format of y~x+z+t+.... No function is allowed in the formula. For instance, log(y)~x+z+t+.. or y~x^2+z+t+... are invalid inputs. The model terms must be in the data otherwise will be removed. Depending on the chosen framework, MM allows the continuous covariates on the right hand side of the model whereas RR and FE only allow the categorical variable for covariates.

2.2 Data preparation with OpenStatsList function

OpenStatsList function performs data processing and creates an OpenStats object. This function allows a data frame as the input so that the rows and columns represent the samples and features respectively. In addition to the dependent variable column that is mandatory, “Genotype” or the corresponded proxy is required. Note that any other name for “Genotype” is accepted however the OpenStatsList function renames it to “Genotype”. To comply with the function structure in PhenStat, the predecessor of OpenStats, the function allows optional Sex, Batch, Weight (bodyweight) and LifeStage in the input parameters.

The main tasks performed by the OpenStats package’s function OpenStatsList are:

  • Preparing a working dataset for the downstream operations
  • terminology normalisation
  • showing a general view of the dataset
  • filtering out undesirable records from the input model (however, the original dataset is always preserved)
  • checking whether the dataset complies with the requirements in the designed statistical frameworks in OpenStats

All checks are accompanied by informative messages, warnings and errors. One example of the OpenStatsList is shown below:

## 2020-04-21 15:48:55. Input data of the dimensions, rows = 410, columns = 75
## 2020-04-21 15:48:55. Checking the input data in progress ...
## 2020-04-21 15:48:55. Checking the specified missing values [x2] (` `, ``) ...
## 2020-04-21 15:48:55.     1/2. Checking (` `) ...
## 2020-04-21 15:48:55.     2/2. Checking (``) ...
## 2020-04-21 15:48:55. Checking whether variable `biological_sample_group` exists in the data ... 
## 2020-04-21 15:48:55.     Result = TRUE
## 2020-04-21 15:48:55.      Levels (Total levels = 2, missings = 0%): 
## 2020-04-21 15:48:55.       1. control
## 2020-04-21 15:48:55.       2. experimental
## 2020-04-21 15:48:55. Checking whether variable `sex` exists in the data ... 
## 2020-04-21 15:48:55.     Result = TRUE
## 2020-04-21 15:48:55.      Levels (Total levels = 2, missings = 0%): 
## 2020-04-21 15:48:55.       1. female
## 2020-04-21 15:48:55.       2. male
## 2020-04-21 15:48:55. Checking whether variable `date_of_experiment` exists in the data ... 
## 2020-04-21 15:48:55.     Result = TRUE
## 2020-04-21 15:48:55.      Levels (Total levels = 43, missings = 0%): 
## 2020-04-21 15:48:55.       1. 2012-07-23T00:00:00Z
## 2020-04-21 15:48:55.       2. 2012-07-30T00:00:00Z
## 2020-04-21 15:48:55.       3. 2012-08-06T00:00:00Z
## 2020-04-21 15:48:55.       4. 2012-08-13T00:00:00Z
## 2020-04-21 15:48:55.       5. 2012-08-20T00:00:00Z
## 2020-04-21 15:48:55.       6. 2012-11-26T00:00:00Z
## 2020-04-21 15:48:55.       7. 2012-12-24T00:00:00Z
## 2020-04-21 15:48:55.       8. 2013-01-02T00:00:00Z
## 2020-04-21 15:48:55.       9. 2013-01-15T00:00:00Z
## 2020-04-21 15:48:55.       10. 201 ...
## 2020-04-21 15:48:55. Checking whether variable `LifeStage` exists in the data ... 
## 2020-04-21 15:48:55.     Result = FALSE
## 2020-04-21 15:48:55. Checking whether variable `weight` exists in the data ... 
## 2020-04-21 15:48:55.     Result = TRUE
## 2020-04-21 15:48:55.      Summary:
## 2020-04-21 15:48:55.       mean      = 20.0036585365854
## 2020-04-21 15:48:55.       sd        = 2.63972182732584
## 2020-04-21 15:48:55.       Missings  = 0%
## 2020-04-21 15:48:55. Variable `biological_sample_group` renamed to `Genotype`
## 2020-04-21 15:48:55. Variable `sex` renamed to `Sex`
## 2020-04-21 15:48:55. Variable `date_of_experiment` renamed to `Batch`
## 2020-04-21 15:48:55. Variable `weight` renamed to `Weight`
## 2020-04-21 15:48:55. Total samples in Genotype:Sex interactions: 
## 2020-04-21 15:48:55.      Level(frequency): 
## 2020-04-21 15:48:55.       1. control.Female(196)
## 2020-04-21 15:48:55.       2. experimental.Female(5)
## 2020-04-21 15:48:55.       3. control.Male(201)
## 2020-04-21 15:48:55.       4. experimental.Male(8)
## 2020-04-21 15:48:55. Total `Weight` data points for Genotype:Sex interactions: 
## 2020-04-21 15:48:55.      Level(frequency): 
## 2020-04-21 15:48:55.       1. control.Female(196),
## 2020-04-21 15:48:55.       2. experimental.Female(5),
## 2020-04-21 15:48:55.       3. control.Male(201),
## 2020-04-21 15:48:55.       4. experimental.Male(8)
## 2020-04-21 15:48:55. Successfully performed checks in 0.1 second(s).

2.2.1 Terminology Normalisation

We define “Terminology Normalisation (TN)” as the terminology used to describe variables that are essential or commonly used in the statistical analysis of the phenotypic data. The OpenStats package uses the following nomenclature for the names of columns: “Sex”, “Genotype”, “Batch”, “LifeStage” and “Weight” (bodyweight). Besides, unless they are specified in the input of the function, the expected levels for

  • Sex are “Male” and “Female”
  • LifeStage are “Early” and “Late”
  • and missings are NA or `` (Null string).

Note that the essential variable Genotype must be specified by the user under dataset.colname.genotype and the levels testGenotype and refGenotype. The other arguments are optional. See the example below where the only Genotype is specified and the other parameters left blank:

## 2020-04-21 15:48:56. Input data of the dimensions, rows = 410, columns = 75
## 2020-04-21 15:48:56. Checking the input data in progress ...
## 2020-04-21 15:48:56. Checking the specified missing values [x2] (` `, ``) ...
## 2020-04-21 15:48:56.     1/2. Checking (` `) ...
## 2020-04-21 15:48:56.     2/2. Checking (``) ...
## 2020-04-21 15:48:56. Checking whether variable `biological_sample_group` exists in the data ... 
## 2020-04-21 15:48:56.     Result = TRUE
## 2020-04-21 15:48:56.      Levels (Total levels = 2, missings = 0%): 
## 2020-04-21 15:48:56.       1. control
## 2020-04-21 15:48:56.       2. experimental
## 2020-04-21 15:48:56. Variable `biological_sample_group` renamed to `Genotype`
## 2020-04-21 15:48:56. Total samples in Genotype: 
## 2020-04-21 15:48:56.      Level(frequency): 
## 2020-04-21 15:48:56.       1. control(397)
## 2020-04-21 15:48:56.       2. experimental(13)
## 2020-04-21 15:48:56. Successfully performed checks in 0.06 second(s).

OpenStatsList function makes a copy of the original dataset and then uses internal definitions to map columns and values from user’s naming system into the package’s nomenclature. The original dataset stays unchanged throughout the entire processes.

2.2.2 PhenList Object

The output of OpenStatsList function is an OpenStats object that contains all the input parameters (including the input dataset) to the function plus the checked dataset. The latter is the input to the standard analysis however, one can create custom analyses that apply to the input dataset (not the checked dataset).

## The structure of an OpenStats:
 1 .  datasetPL   
 2 .  refGenotype   
 3 .  testGenotype   
 4 .  hemiGenotype   
 5 .  clean.dataset   
 6 .  dataset.colname.genotype   
 7 .  dataset.colname.sex   
 8 .  dataset.colname.batch   
 9 .  dataset.colname.lifestage   
 10 .  dataset.colname.weight   
 11 .  dataset.values.missingValue   
 12 .  dataset.values.male   
 13 .  dataset.values.female   
 14 .  dataset.values.early   
 15 .  dataset.values.late   
 16 .  datasetUNF   

where the first element (datasetPL) encapsulate the checked dataset and the last one (datasetUNF) is the untouched input data.

2.2.3 Plot and summary of an OpenStats object

The standard plot and summary function can be applied to an OpenStats object. Below shows two examples of continues and categorical data.

## 2020-04-21 15:48:56. Working on the summary table ...

Data Frame Summary

Dimensions: 410 x 4
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 data_point [numeric] Mean (sd) : 11 (1.7) min < med < max: 7 < 10.9 < 16.7 IQR (CV) : 2.6 (0.2) 408 distinct values
  : .
: : : :
. : : : : .
: : : : : . : : : : : : :
410 (100%) 0 (0%)
2 Genotype [factor]
  1. control
  2. experimental
397 (96.8%) 13 ( 3.2%) IIIIIIIIIIIIIIIIIII 410 (100%) 0 (0%)
3 Sex [factor]
  1. Female
  2. Male
201 (49.0%) 209 (51.0%) IIIIIIIII IIIIIIIIII 410 (100%) 0 (0%)
4 Batch [factor]
  1. 2012-07-23T00:00:00Z
  2. 2012-07-30T00:00:00Z
  3. 2012-08-06T00:00:00Z
  4. 2012-08-13T00:00:00Z
  5. 2012-08-20T00:00:00Z
  6. 2012-11-26T00:00:00Z
  7. 2012-12-24T00:00:00Z
  8. 2013-01-02T00:00:00Z
  9. 2013-01-15T00:00:00Z
  10. 2013-01-21T00:00:00Z [ 33 others ]
10 ( 2.4%) 10 ( 2.4%) 10 ( 2.4%) 10 ( 2.4%) 9 ( 2.2%) 10 ( 2.4%) 10 ( 2.4%) 10 ( 2.4%) 9 ( 2.2%) 4 ( 1.0%) 318 (77.6%) IIIIIIIIIIIIIII 410 (100%) 0 (0%)
## 2020-04-21 15:48:56. Working on the plot ...

2.3 Statistical Analysis

The OpenStats package encompasses three frameworks for statistical analysis of the phenotypic data: Linear Mixed Models (MM) and Reference Range method (RR) for continuous data, Fisher’s Exact Test (FE) for categorical data, see here for a short description of each method. For all frameworks, the implementation allows custom modelling of the input data.

2.3.1 Manager for Analysis Methods – OpenStatsAnalysis function

OpenStack’s function OpenStatsAnalysis works as a manager for the different statistical analyses methods. It checks the dependent variable, runs the selected statistical analysis framework and returns modelling/testing results in either OpenStatsMM, OpenStatsRR and OpenStatsFE object.

2.3.1.1 Function Arguments

Except for the input data and the framework (MM, RR or FE), the input arguments in OpenStatsAnalysis prefixed by the framework name for instance MM_fixed only applies to the Linear Mixed Model framework and so are RR_formula and FE_formula. Some arguments apply in more than one framework. These arguments are prefixed by the concatenation of frameworks such as FERR that applies to Fisher’s exact test and the Reference range plus frameworks.

2.3.2 Linear mixed model framework

The linear mixed model framework consists of routines to apply a linear mixed model to the data. The input fix effects term (MM_fixed) must be specified in the standard R formula for instance, y~Genotype+Sex+Genotype:Sex + Weight that defines the following statistical model

\[ y = intercept + \beta_1 Genotype + \beta_2 Sex + \beta_3 Genotype\times Sex + \beta_4 Weight + e \] where \(\beta_i, i=1,2,3,4\) are unknown coefficients and the \(e\) is the normaly distributed error.

Random effect term (MM_random): This is an optional one-sided formula of the form of ~XX|YY. For example, ~1|Batch represents a random intercept model.

Weight effect (MM_weight): Not been confused with the body weight, it is a part of estimating \(\beta_i\)’s, see here for simple discription of weights in the linear models. It is an optional varFunc (more details can be found in the R manual for lme function) object or one-sided formula describing the within-group heteroscedasticity structure. If given as a formula, it is used as the argument to varFixed, corresponding to fixed variance weights. See the documentation on varClasses for a description of the available varFunc classes. If not specified/set to NULL then the default setupvarIdent(form = ~ 1 | LifeStage) or varIdent(form = ~ 1 | Genotype) given LifeStage included in the input data applies. The former assumes the same variation within the levels of LifeStage/Genotype.

Setting the fixed and/or random effect as well as weight effect is enough to run a linear mixed model however OpenStats is capable of applying a model optimisation to the model. Below we explain the optimisation arguments and how it works.

2.3.2.1 Model assessment

OpenStats checks the input model and the data for the erroneous scenarios listed below

  1. The model terms (MM_fixed/MM_random) does not exist in the input data.
  2. OpenStats removes any single level factor from the fixed-effect model (MM_fixed). Example, Sex with only Female values
  3. OpenStats removes any single value (no variation) term from the fixed-effect model (MM_fixed). Example, the same value for the Age covariate
  4. OpenStats checks the interaction term to make sure all interactions have some data attached. Example, no value in Female.Control
  5. OpenStats removes duplicated columns in the dataset before applying the model. Example, two variables have the same values
  6. OpenStats reports the missings in the model terms and alerts when more than 50% missings detected. Note that OpenStats does not do more except notifying the user. The missings must be handled before the analysis.

One can turn the checks 1-4 on/off by setting TRUE/FALSE in the MM_checks argument of the OpenStatsAnalysis function.

2.3.2.2 Model optimisation and MM_optimise

OpenStats enjoys an optional optimisation (model selection) step to reduce the complexity as well as to improve the power of the analysis. This is controlled by the MM_optimise argument. The first three elements of MM_optimise for example (TRUE, TRUE, TRUE) apply to the optimisation step of the fixed, weight and random effect respectively. MM_direction controls the direction of the fixed_effect optimisation, forward selection (FS), backward elimination (BE) and stepwise (STPW). The forward selection starts from an intercept and at each step adds more variables. The backward elimination starts from a fully saturated model and at each step removes/keep a variable. In contrast, STPW combines two other methods and add/remove/preserve a variable at each step.

OpenStats utilises AICc, Akaike information criterion (AIC), that has a correction for small sample sizes to perform the mutual comparisons in the model optimisation (selection). AICc is defined by

\[AICc = AIC+\frac{2k^2+2k}{n-k-1}\] where \(k\) and \(n\) are the number of parameters and samples respectively and AIC is defined by \[-2logLikelihood + 2k.\]

Note that OpenStats allows a minimal model for the optimisation (default genotype effect ~ Genotype + 1) to prevent the possible chance of eliminating the effects of interest. This is controlled by a right-sided formula in MM_lower.

Further to the fixed effects, OpenStats allows optimisation on the random and the weight effects by comparing the AICc between a model with and without these effects. To summerise, the order of optimisation is as follows:

  1. Fixed effects: Three possible model selection scenarios FS/BE/STPW. The mutual sub-models are compared on the bases of AICc. Confidence Intervals (CI) at the level of MMFERR_conf.level are estimated for the variable in the fixed effects (MM_fixed).
  2. Weight effect. A model with the weight effect is compared to the one without this effect based on AICcc.
  3. Random effect. A model with the random effect is compared to the one without this effect based on AICcc. CI at the level of MMFERR_conf.level is estimated for the random effect.

Because the AICc only applies to the Maximum Likelihood (ML) estimation of the parameters, all models are first estimated by ML but the final models are re-estimated using Restricted Maximum Likelihood method (REML) that is proven to have a better fit in the linear mixed models setup.

The example below fits an optimised linear mixed model to the data:

2.3.2.3 Sub-model estimation

OpenStats can estimate submodels from an input model. This is called Split model effects in the outputs. This is mainly useful for reporting sex/age-specific effects. This is performed by creating submodels of a full model. For instance, for the input fixed effect (MM_fixed) model response~Genotype+Sex+Weight a possible submodel is response~Sex+Sex:Genotype + Weight that can be used to estimate sex-specific effects for genotype. This model is then estimated under the configuration of the optimal model. One can turn off Split model effects by setting the fourth element of MM_optimise to FALSE.

An alternative to the analytically estimating the sub-models is to break the input data into splits and run the model on the subset of the data. This can be performed by passing the output of OpenStats analysis, OpenStatsMM, to the function, OpenStatsComplementarySplit. This function allows the OpenStatsMM as the input and a set of variable names that split happens on. The example below shows the split on ‘Sex’:

## 2020-04-21 15:48:58. Split effects in progress ...
## 2020-04-21 15:48:58. Variable(s) to split:
## 2020-04-21 15:48:58.     Sex
## 2020-04-21 15:48:58.     Splitting in progress ...
## 2020-04-21 15:48:58.     Spliting on Sex ...
## 2020-04-21 15:48:58. Processing the levels: Female
## 2020-04-21 15:48:58. [Successful]
## 2020-04-21 15:48:58. Processing the levels: Male
## 2020-04-21 15:48:59. [Successful]

2.3.2.4 Effect size and percentage change

OpenStats estimates the normalised effect size for each variable in the (input) fixed effects. This is performed by first normalising the data and following the steps below

  • For the categorical variables such as Sex, Genotype etc:
    1. Normalise the input data to have zero mean and unit standard deviation, columnwise
    2. Fit the sub model \(response=\beta_0+\beta_1 VariableOfInterest+e\) under the optimal model configuration and estimate the standard deviation of the residuals, \(\sigma_e\)
    3. Calculate the max absolute distance (MAD) of the mean of response under each level of the categorical variable e.g.mean response for the control and the threatment group
    4. Estimate the effect size by \(MAD/\sigma_e\)
  • For the continuous variables such as body weight:
    1. Normalise the input data to have zero mean and unit standard deviation, columnwise
    2. Fit the sub model \(response=\beta_0+\beta_1 VariableOfInterest+e\) under the optimal model configuration
    3. Report \(\beta_1\) as the effect size

The percentage change for all variables (categorical and continues) is estimated as below

  1. Normalise the input data to have zero mean and unit standard deviation, columnwise
  2. Fit the sub model \(response=\beta_0+\beta_1 VariableOfInterest+e\) under the optimal model configuration
  3. Obtain \(\beta_1\) for the percentage change

Note that one can exclude the calculation of the effect sizes and percentage changes by setting the fifth element of MM_optimise to FALSE.

2.3.2.5 Diagnosis the model

Further to some visualisation tools, OpenStats assesses the normality of the residuals from the optimised model for all levels of the explanatory variables. The example below shows the output of the normality test:

## Variable: Genotype
##       levels:
##          control = 0.250068296469792, 
##          experimental = 0.769755878907674
## Variable: Sex
##       levels:
##          Female = 0.132572244195251, 
##          Male = 0.0642736685997496
## Variable: Genotype_Sex
##       levels:
##          control Female = 0.156292393095656, 
##          experimental Female = 0.402230280773861, 
##          control Male = 0.0556910551209782, 
##          experimental Male = 0.798131346349259
## Variable: Overall = 0.230443969421063

It is also possible to diagnose the goodness of fits by using plot and summary functions. Below shows an example of these two functions.

## 2020-04-21 15:49:00. Working out the summary table ...
Statistic Value
Applied framework Linear Mixed Model framework, LME, not including Weight
Final model data_point ~ Genotype + Sex
………………………. ……………………….
Tested Gene experimental
Reference Gene control
………………………. ……………………….
Sexual dimorphism detected? FALSE, Genotype-Sex interaction is part of the input (it is not part of the final) model.
………………………. ……………………….
Genotype contribution overall 0.465138433592606
Genotype contribution Females 0.878763961951334
Genotype contribution Males 0.273730539893548
………………………. ……………………….
LifeStage contribution -
Genotype contribution Early -
Genotype contribution Late -
………………………. ……………………….
Sex contribution 0
Body weight contribution -
## 2020-04-21 15:49:00. Working on the plot ...

2.3.2.6 Notes, messages and warnings

OpenStats is designed to report the progress of the function in a concise but informative way. All fatal errors are stored in a placeholder named messages in the output object. The minor errors and warnings are reported, provided debug is set to TRUE. OpenStatsAnalysis is capable of coping with some already known scenarios that lead to a fatal error. For example, if the random effect is the cause of the error, then OpenStats removes the term. The same applies to the fix effects. One could turn off all reports by setting debug=FALSE.

The example below shows a complete round of analysis by OpenStatsAnalysis

## 2020-04-21 15:49:01. OpenStats loaded.
## 2020-04-21 15:49:01. Checking the input model for including functions ...
## 2020-04-21 15:49:01. Linear Mixed Model (MM framework) in progress ...
## 2020-04-21 15:49:01. Removing possible leading/trailing whitespace from the variables in the formula ...
## 2020-04-21 15:49:01. Checking duplications in the data model:
## 2020-04-21 15:49:01.      Genotype, data_point, Sex
## 2020-04-21 15:49:01.     No duplicate found.
## 2020-04-21 15:49:01. Checking for the feasibility of terms and interactions ...
## 2020-04-21 15:49:01.      Formula: data_point ~ Genotype + Sex + Genotype:Sex
## 2020-04-21 15:49:01.     1 of 2. Checking for the feasibility of terms and interactions ...
## 2020-04-21 15:49:01.          Checking Genotype ...
## 2020-04-21 15:49:01.          Checking Sex ...
## 2020-04-21 15:49:01.     2 of 2. Checking for the feasibility of terms and interactions ...
## 2020-04-21 15:49:01.          Checking Genotype, Sex ...
## 2020-04-21 15:49:01.     Checked model: data_point ~ Genotype + Sex + Genotype:Sex
## 2020-04-21 15:49:01. Check missings in progress ...
## 2020-04-21 15:49:01.      Missings in variable `data_point`: 0%
## 2020-04-21 15:49:01.      Missings in variable `Genotype`: 0%
## 2020-04-21 15:49:01.      Missings in variable `Sex`: 0%
## 2020-04-21 15:49:01. Checking the random effect term ...
## 2020-04-21 15:49:01.     Formula: ~1 | Batch
## 2020-04-21 15:49:01. Lme: Fitting the full model ...
## 2020-04-21 15:49:01.     Applied model: lme
## 2020-04-21 15:49:01.     The full model successfully applied.
## 2020-04-21 15:49:01.     Computing the confidence intervals at the level of 0.95 ...
## 2020-04-21 15:49:01.      CI for `all` term(s) successfully estimated
## 2020-04-21 15:49:01. The specified "lower" model: 
## 2020-04-21 15:49:01.     ~1 + Genotype
## 2020-04-21 15:49:01. The model optimisation is in progress ...
## 2020-04-21 15:49:01.     The direction of the optimisation (backward, forward, both): both
## 2020-04-21 15:49:01.     Optimising the model ...
## 2020-04-21 15:49:01.     Optimised model: data_point ~ Genotype + Sex
## 2020-04-21 15:49:01.     Computing the confidence intervals at the level of 0.95 ...
## 2020-04-21 15:49:01.      CI for `all` term(s) successfully estimated
## 2020-04-21 15:49:01. Testing varHom ...
## 2020-04-21 15:49:01.     Computing the confidence intervals at the level of 0.95 ...
## 2020-04-21 15:49:01.      CI for `all` term(s) successfully estimated
## 2020-04-21 15:49:01.     VarHom checked out ...
## 2020-04-21 15:49:01. Testing Batch ...
## 2020-04-21 15:49:01.     Computing the confidence intervals at the level of 0.95 ...
## 2020-04-21 15:49:01.      CI for `all` term(s) successfully estimated
## 2020-04-21 15:49:01. 1. Split on Sex ...
## 2020-04-21 15:49:01. Checking the split model:
## 2020-04-21 15:49:01.     data_point ~ Sex + Genotype:Sex
## 2020-04-21 15:49:01.     Tested model: data_point ~ Sex + Genotype:Sex [Successful]
## 2020-04-21 15:49:01.     Computing the confidence intervals at the level of 0.95 ...
## 2020-04-21 15:49:01.      CI for `all` term(s) successfully estimated
## 2020-04-21 15:49:01. SplitEffects. Output names: Genotype_Sex
## 2020-04-21 15:49:01. Estimating effect sizes ...
## 2020-04-21 15:49:02.     Total effect sizes estimated: 7
## 2020-04-21 15:49:02. Quality tests in progress ...
## 2020-04-21 15:49:02. MM framework executed in 1.14 second(s).

2.3.3 Fisher exact test framework

Fisher Exact Test is a standard framework for analysing categorical data that assesses the null hypothesis of independence of rows and columns in a contingency table with fixed marginals. This is the subject of the Fisher Exact (FE) framework in OpenStats.

2.3.3.1 Implementation

The Fisher Exact framework test is implemented with generic R functions fisher.test() from the stats package after the construction of count matrices from the dataset. Together with count matrices we also calculate the effect sizes and testing for the all possible sub-tables of the full table.

The FE methodology is accessible from the OpenStatsAnalysis function manager by setting method to “FE”. For the FE methodology, some additional arguments determine how the FE is executed. The model of the data needs to be specified in the form of standard R formula in FE_formula, e.g. response~covariates. No mathematical operation is allowed in the formula and the interactions are ignored. As a result, response~2*(Sex) is an invalid and response~Sex*Genotype is considered as response~Sex+Genotype. All terms on the model, FE_formula, must be categorical with at least two levels. Then all single level factors are removed in the model checking step. The FE framework runs a Fisher exact under the analytical implementation and the Monte Carlo, controlled by the number of Monte Carlo iterations FERR_rep or set to zero FERR_rep=0 disables Monte Carlo estimation, for testing all covariates in the right-hand side of the FE_formula as well as combined sub levels. This is shown in the example below where the dataset is borrowed from the IMPC and FE is applied to detect any possible significant effect caused by genotype.

## 2020-04-21 15:49:02. OpenStats loaded.
## 2020-04-21 15:49:02. Checking the input model for including functions ...
## 2020-04-21 15:49:02. Fisher Exact Test (FE framework) in progress ...
## 2020-04-21 15:49:02. Optimisation level:
## 2020-04-21 15:49:02.     Estimation of all factor combination effects = TRUE
## 2020-04-21 15:49:02.     Estimation of inter level factors for the response = FALSE
## 2020-04-21 15:49:02. The input formula: Thoracic.Processes ~ Genotype + Sex
## 2020-04-21 15:49:02. The reformatted formula for the algorithm: ~Thoracic.Processes + Genotype + Sex
## 2020-04-21 15:49:02.     Top framework: FE
## 2020-04-21 15:49:02. Fisher exact test with 1500 iteration(s) in progress ...
## 2020-04-21 15:49:02. Check missings in progress ...
## 2020-04-21 15:49:02.      Missings in variable `Thoracic.Processes`: 0.32%
## 2020-04-21 15:49:02.      Missings in variable `Genotype`: 0%
## 2020-04-21 15:49:02.      Missings in variable `Sex`: 0%
## 2020-04-21 15:49:02. Removing possible leading/trailing whitespace from the variables in the formula ...
## 2020-04-21 15:49:02. Step 1. Testing "Thoracic.Processes"
## 2020-04-21 15:49:02. The data (variable(s) = Thoracic.Processes) contain 2 missing(s) ...
## 2020-04-21 15:49:02.     Missing data removed
## 2020-04-21 15:49:02.     Testing for the main effect: Genotype
## 2020-04-21 15:49:02.     Testing for the main effect: Sex
## 2020-04-21 15:49:02. Combined effects in progress ...
## 2020-04-21 15:49:02.     Splitting in progress ...
## 2020-04-21 15:49:02.     Spliting on Genotype ...
## 2020-04-21 15:49:02.     Spliting on Sex ...
## 2020-04-21 15:49:02.     Shrinking in progress ...
## 2020-04-21 15:49:02.     Dichotomising the final tables ...
## 2020-04-21 15:49:02.     Finalising the tables ....
## 2020-04-21 15:49:02. Testing for the combined effects ...
## 2020-04-21 15:49:02. Step 2. Testing "Genotype"
## 2020-04-21 15:49:02.     Testing for the main effect: Sex
## 2020-04-21 15:49:02. Total tested categories = 2: Thoracic.Processes, Genotype
## 2020-04-21 15:49:02.     Total tests =  7
## 2020-04-21 15:49:02. FE framework  executed in 0.05 second(s).

Below shows the output structure of the results from the FE framework for testing the response, “Thoracic.Processes” with “+/+” the control group and “Aff3/Aff3” the mutants.

## The structure of an FE output object for "Thoracic.Processes":
 1 .  Genotype   
 2 .  Sex   
 3 .  Sex_+/+   
 4 .  Sex_Aff3/Aff3   
 5 .  Genotype_Female   
 6 .  Genotype_Male   

One could interpret the results above as **Main effect(s)_Level(s)**. For example Genotype_Females represents a test of response on Genotype for females only. The corresponded contingency table of this example is shown below:

## Test on Females only:
+/+ Aff3/Aff3
Abnormal 57 7
Normal 250 0

2.3.3.2 Optimisation

There are a number of optimisation parameters that can be set in FERR_FullComparisions parameter. This must be a vector of two logical flags, default c(TRUE, FALSE). Setting the first value to TRUE, then all combinations of the effects (all levels of factors in the input model - for example LifeStage_Male, Genotype_Male, LifeStage_Female, Genotype_Female and so on) will be tested. Otherwise, only the main effects (no sublevels - for example, Sex_LifeStage [not for instance LifeStage_Male]) will be tested. Setting the second element to TRUE (default FALSE) will force the Fisher’s exact test to do all comparisons between different levels of the RESPONSE variable, provide the response has more than two levels. For example, if the response has three levels such as 1.positive, 2.negative and 3.neutral then the comparisons will be between 1&2, 1&3, 2&3 and 1&2&3 (obviously the latter is the complete table).

The effect size for a test is estimated as the maximum percentage change seen in the contingency table corresponded to the test. This is briefly explained on this page.

2.3.3.3 Plot and summary

The generic plot and summary functions are available for the FE output object, OpenStatsFE. The example below shows the typical plot and summary for an OpenStatsFE output from the example above.

## 2020-04-21 15:49:02. Working out the summary table ...
Statistic Value
Applied framework Fisher Exact Test framework
Final model Thoracic.Processes ~ Genotype + Sex
………………………. ……………………….
Tested Gene Aff3/Aff3
Reference Gene +/+
………………………. ……………………….
Sexual dimorphism detected? TRUE, Sex specific results are always reported.
………………………. ……………………….
Genotype contribution overall 3.0743402785008e-09
Genotype contribution Females 1.1127905915376e-05
Genotype contribution Males 0.000159596518526762
………………………. ……………………….
LifeStage contribution -
Genotype contribution Early -
Genotype contribution Late -
………………………. ……………………….
Sex contribution 0.0137221975272656
Body weight contribution -
## 2020-04-21 15:49:02. Working on the plot ...
## 2020-04-21 15:49:02. Can not find [Thoracic.Processes X LifeStage] table

2.3.4 Reference range plus framework

There are some cases where the linear mixed model fails or leads to misleading results, a very small variation in the response or an unusual number of outliers in the data to name some. Reference Range plus (RR+) method is useful to cope with these unusual scenarios. This methodology is considered a conservative method where data are classified as normal, low or high depending on whether they lie within the natural variation seen within the controls. Once the data are classified by using the ranges calculated from control data, then a Fisher Exact Test is used to compare the movement towards the low or high classification, e.g. (Low versus Normal and High) or (Low and Normal versus High). More details about RR+ can be found on this page.

2.3.4.1 Implementation

The RR+ methodology is driven from the OpenStatsAnalysis function manager. To analyse the data using the RR+ methodology the argument method needs to be set to “RR”. For the RR+ methodology, there are a number of additional arguments that determine how the RR+ is executed. The model of the data needs to be specified in the form of standard R formula in RR_formula, e.g. response~covariates. The left-hand side of the model must be continuous and right-hand side of the model must contain the categorical variables only, no mathematical operation is allowed in the formula and the interactions are ignored. As a result, log(response)~Genotype+log(weight) is an invalid and response~Sex*Genotype is considered as response~Sex+Genotype. The first term on the right-hand side of the formula RR_formula is called Reference variable and must be categorical with least of two levels. The level specified in RRrefLevel determines the control population so that the other levels are compared to. The RR_prop argument (defaults to 95, minimal value is set to 50) determines how much of the data is classified as normal and thus is used in the calculations of the classification threshold. The thresholds are determined using a percentile method, which avoids any distribution assumptions. For example, when RR_prop defaults to 95, then 95% of the data (that labelled as RRrefLevel) will be classed as normal, i.e. the 2.5 \([(1-.95)/2]\%\) and 97.5 \([1-(1-0.95)/2]\%\)) percentile thresholds of the dependent variable are used to define the boundaries of normal. Finally FERR_rep determines the number of iterations in the Monte Carlo Fisher exact test or alternatively setting to zero FERR_rep=0 disables Monte Carlo estimation.

2.3.4.2 Comparision structure

The comparisons are broken into the two groups, Low versus NormalHigh and LowNormal versus High. This is prefixed by Low. or High. in the output of OpenStatsAnalysis. The output structure is coded as Low/High.Response.Reference Variable. Other variable levels. The example below shows a full round of analysis by RR+ method where the reference variable and level are set to Genotype and control respectively,

## 2020-04-21 15:49:02. OpenStats loaded.
## 2020-04-21 15:49:02. Checking the input model for including functions ...
## 2020-04-21 15:49:02. Reference Range Plus (RR framework) in progress ...
## 2020-04-21 15:49:02. Optimisation level:
## 2020-04-21 15:49:02.     Estimation of all factor combination effects = TRUE
## 2020-04-21 15:49:02.     Estimation of inter level factors for the response = FALSE
## 2020-04-21 15:49:02. The input formula: data_point ~ Genotype + Sex
## 2020-04-21 15:49:02. The reformatted formula for the algorithm: ~data_point + Genotype + Sex
## 2020-04-21 15:49:02. The probability of the middle area in the distribution: 0.95
## 2020-04-21 15:49:02.      Tails probability: 0.025
## 2020-04-21 15:49:02.      Formula to calculate the tail probabilities: 1-(1-x)/2, (1-x)/2 where x = 0.95
## 2020-04-21 15:49:02. Discritizing the continuous data into discrete levels. The quantile = 0.975
## 2020-04-21 15:49:02. Stp 1. Low versus Normal/High
## 2020-04-21 15:49:02. Removing possible leading/trailing whitespace from the variables in the formula ...
## 2020-04-21 15:49:02. Preparing the reference ranges ...
## 2020-04-21 15:49:02. Preparing the data for the variable: Genotype
## 2020-04-21 15:49:03. Reference level is set to `control`
## 2020-04-21 15:49:03.     Initial quantiles for cutting the data 
## 2020-04-21 15:49:03.              Probs: 0, 0.025, 1
## 2020-04-21 15:49:03.              N.reference: 397
## 2020-04-21 15:49:03.              Quantiles: 6.04, 8.08, 17.73
## 2020-04-21 15:49:03.      Detected percentiles in the data (8 decimals): Low = 0.02518892, NormalHigh = 0.97481108
## 2020-04-21 15:49:03.     Spliting on Sex ...
## 2020-04-21 15:49:03. Preparing the data for the combined effect: Female
## 2020-04-21 15:49:03. Reference level is set to `control`
## 2020-04-21 15:49:03.     Initial quantiles for cutting the data 
## 2020-04-21 15:49:03.              Probs: 0, 0.025, 1
## 2020-04-21 15:49:03.              N.reference: 196
## 2020-04-21 15:49:03.              Quantiles: 7.579, 9.267, 17.73
## 2020-04-21 15:49:03.      Detected percentiles in the data (8 decimals): Low = 0.0255102, NormalHigh = 0.9744898
## 2020-04-21 15:49:03. Preparing the data for the combined effect: Male
## 2020-04-21 15:49:03. Reference level is set to `control`
## 2020-04-21 15:49:03.     Initial quantiles for cutting the data 
## 2020-04-21 15:49:03.              Probs: 0, 0.025, 1
## 2020-04-21 15:49:03.              N.reference: 201
## 2020-04-21 15:49:03.              Quantiles: 6.04, 7.673, 15.645
## 2020-04-21 15:49:03.      Detected percentiles in the data (8 decimals): Low = 0.02985075, NormalHigh = 0.97014925
## 2020-04-21 15:49:03. Stp 2. Low/Normal versus High
## 2020-04-21 15:49:03. Removing possible leading/trailing whitespace from the variables in the formula ...
## 2020-04-21 15:49:03. Preparing the reference ranges ...
## 2020-04-21 15:49:03. Preparing the data for the variable: Genotype
## 2020-04-21 15:49:03. Reference level is set to `control`
## 2020-04-21 15:49:03.     Initial quantiles for cutting the data 
## 2020-04-21 15:49:03.              Probs: 0, 0.975, 1
## 2020-04-21 15:49:03.              N.reference: 397
## 2020-04-21 15:49:03.              Quantiles: 6.04, 14.5, 17.73
## 2020-04-21 15:49:03.      Detected percentiles in the data (8 decimals): LowNormal = 0.97481108, High = 0.02518892
## 2020-04-21 15:49:03.     Spliting on Sex ...
## 2020-04-21 15:49:03. Preparing the data for the combined effect: Female
## 2020-04-21 15:49:03. Reference level is set to `control`
## 2020-04-21 15:49:03.     Initial quantiles for cutting the data 
## 2020-04-21 15:49:03.              Probs: 0, 0.975, 1
## 2020-04-21 15:49:03.              N.reference: 196
## 2020-04-21 15:49:03.              Quantiles: 7.579, 14.744, 17.73
## 2020-04-21 15:49:03.      Detected percentiles in the data (8 decimals): LowNormal = 0.9744898, High = 0.0255102
## 2020-04-21 15:49:03. Preparing the data for the combined effect: Male
## 2020-04-21 15:49:03. Reference level is set to `control`
## 2020-04-21 15:49:03.     Initial quantiles for cutting the data 
## 2020-04-21 15:49:03.              Probs: 0, 0.975, 1
## 2020-04-21 15:49:03.              N.reference: 201
## 2020-04-21 15:49:03.              Quantiles: 6.04, 13.527, 15.645
## 2020-04-21 15:49:03.      Detected percentiles in the data (8 decimals): LowNormal = 0.97014925, High = 0.02985075
## 2020-04-21 15:49:03. Fisher exact test with 1500 iteration(s) in progress ...
## 2020-04-21 15:49:03. Analysing Low vs NormalHigh ...
## 2020-04-21 15:49:03. Analysing LowNormal vs High ...
## 2020-04-21 15:49:03. RR framework executed in 0.31 second(s).

Below shows the top-level outputs for the example above:

## The structure of an RR+ output object:
 1 .  Low.data_point.Genotype   
 2 .  Low.data_point.Genotype.Female   
 3 .  Low.data_point.Genotype.Male   
 4 .  High.data_point.Genotype   
 5 .  High.data_point.Genotype.Female   
 6 .  High.data_point.Genotype.Male   

There are also sub-level comparisons available for each class of tests. For instance, The example below shows the sublevel comparisons for the (Low versus NormalHigh) class:

## The structure of an RR+ output object for Low vs NormalHigh:
 1 .  Genotype   
 2 .  Sex   
 3 .  Sex_control   
 4 .  Sex_experimental   
 5 .  Genotype_Female   
 6 .  Genotype_Male   

2.3.4.3 Optimisation

There are some optimisation parameters that can be set in FERR_FullComparisions argument. This must be a vector of two logical flags, default c(TRUE, FALSE). Setting the first value to TRUE, then all combinations of the effects (all levels of factors in the input model - for example LifeStage_Male, Genotype_Male, LifeStage_Female, Genotype_Female and so on) will be tested. Otherwise, only the main effects (no sublevels - for example, Sex_LifeStage [not for instance LifeStage_Male]) will be tested. The second element of the vector only applies to the categorical response (FE framework). Setting this element to TRUE (default FALSE) will force the Fisher’s exact test to do all comparisons between different levels of the RESPONSE variable, provided the response has more than two levels. The latter does apply to RR+ as there are always two levels in the discretised response.

2.3.4.4 Effect size and confidence interval

The effect size for each test is estimated as the maximum percentage change seen in the contingency table corresponded to the test. This is briefly explained on this page.

2.3.4.5 Plot and summary

The generic plot and summary are available for the RR+ output object, OpenStatsRR. The example below shows the typical plot and summary for the RR+ object.

## 2020-04-21 15:49:03. Working out the summary table ...
Statistic Value
Applied framework Reference Range Plus Test framework; quantile = 0.95 (Tails probability = 0.025)
Final model data_point ~ Genotype + Sex
………………………. ……………………….
Tested Gene experimental
Reference Gene control
………………………. ……………………….
Sexual dimorphism detected? TRUE, Sex specific results for Low/High tables are always reported.
………………………. * Separate p-values for (Low vs NormalHigh), (LowNormal vs High) and details
Genotype contribution overall 0.0512363163025528, 1, Data is discritised by Genotype levels
Genotype contribution Females 1, 1, Data is first split by Sex levels then discritised by Genotype levels
Genotype contribution Males 0.242166192870571, 1, Data is first split by Sex levels then discritised by Genotype levels
………………………. ……………………….
LifeStage contribution -
Genotype contribution Early -
Genotype contribution Late -
………………………. ……………………….
Sex contribution 0.000424761257560054, 0.00954303832245549, Data is discritised by Genotype levels
Body weight contribution -
## 2020-04-21 15:49:03. Working on the plot ...
## 2020-04-21 15:49:03. Can not find [High.data_point.Genotype X LifeStage] table

2.3.4.6 Note

As pointed out in here, RR+ should not be used when the control population contains less than 40 observations. Further, one should use test the better option Linear mixed model propr to applying RR+.

2.3.5 Output and report with OpenStatsReport

2.3.5.1 Output

The output of OpenStatsAnalysis is formalised into three placeholders,

  • Input. The input to the OpenStatsAnalysis function
  • Output. The output object encapsulates the model fits, optimisation steps, results, confidence intervals, effect sizes, test results, etc
  • Extra. There are some values that do not fit in the two categories above. They are stores in this parameter. For instance, missings, checked formula etc.

Below shows the input/output and extra section of an RR+ output object:

## The structure of an `input` object :
 1 .  OpenStatsList   
 2 .  data   
 3 .  depVariable   
 4 .  rep   
 5 .  method   
 6 .  formula   
 7 .  prop   
 8 .  ci_level   
 9 .  refLevel   
 10 .  full_comparisions   
## The structure of an `output` object:
 1 .  Low.data_point.Genotype   
 2 .  Low.data_point.Genotype.Female   
 3 .  Low.data_point.Genotype.Male   
 4 .  High.data_point.Genotype   
 5 .  High.data_point.Genotype.Female   
 6 .  High.data_point.Genotype.Male   
## The structure of an `extra` object:
 1 .  Cleanedformula   
 2 .  TransformedRRprop   

2.3.5.2 Report

There are some predefined effects such as sex effect that are typically used in the phenotypic analysis. A collection of this parameters are already reported in this document. OpenStatsReport reports this list in JSON or generic R list format.

3 Further questions

This manual followed by the function manual in the OpenStats R package address the theories and implementation of the three statistical frameworks in OpenStats. The further questions could be requested by the email of the OpenStats developer, .