Review for: DE-RC: Enhancing Deterministic Inference for Dropout Models through Random Consistency Mitigation

Summary

This paper studies the instability of predictions in neural networks with dropout and formulates the Randomized Classification (RC) problem, where different dropout masks may lead to inconsistent predictions for the same input. The authors propose DE-RC, a method that employs multiple subnetworks with different dropout masks and introduces a consistency regularization term to reduce prediction variance across masks.

The paper also provides a theoretical analysis relating ensemble variance, mask correlation, and RC probability. Empirically, the method is evaluated on 25 datasets and compared with two baselines: a single network and a voting ensemble.

Overall, the goal of the paper is to demonstrate that reducing prediction variance across dropout masks leads to more deterministic inference while maintaining classification performance.


Strengths

  • Interesting problem formulation.
    The paper highlights an important yet often overlooked issue in neural networks with dropout, namely the instability of predictions caused by random dropout masks. The authors formalize this phenomenon as the Randomized Classification (RC) problem, which provides a clear conceptual framework for studying prediction inconsistency.

  • Novel training perspective for addressing dropout-induced variance.
    The key idea of suppressing dropout-induced prediction variance during training, rather than relying solely on inference-time averaging, is interesting and potentially useful for improving deterministic inference.

  • Solid theoretical motivation.
    The paper provides theoretical analysis to support the proposed approach. In particular, the variance decomposition separating inference variance and training variance helps identify the core source of the RC problem. The analysis of ensemble variance and subnetwork correlation provides further intuition for why disagreement among dropout subnetworks should be reduced.

  • Theory-informed method design.
    The proposed framework is clearly motivated by the theoretical insights presented earlier in the paper. The use of multiple subnetworks and consistency regularization is aligned with the goal of reducing prediction disagreement across dropout masks.

  • Conceptually simple and easy-to-integrate approach.
    The proposed method does not require major architectural changes and can be incorporated into standard neural network training pipelines, which may make it attractive for practical applications.

  • Relatively extensive empirical evaluation.
    The method is evaluated on a large number of datasets (25 in total) covering multiple data domains. The experiments consider several evaluation criteria, including predictive performance, stability, robustness, and computational cost.

  • Potential practical relevance.
    Reducing prediction variance caused by dropout may improve the reliability of neural network inference, which can be important for applications where deterministic behavior is desirable. —

Weaknesses

1. Mismatch between theoretical objective and experimental evaluation

The theoretical analysis focuses on reducing the inference variance induced by dropout masks, expressed as

\[ Var_r(\hat{y}(r)). \]

However, the experiments report variance across dataset partitions, which conflates several sources of randomness, including:

  • data splits
  • model initialization
  • optimization randomness
  • dropout randomness

Therefore, the reported variance does not directly correspond to the inference variance analyzed in the theoretical section, making the empirical validation of the theory less convincing.


2. Incomplete architecture description

The paper describes the proposed architecture as consisting of multiple subnetworks, each implemented as an MLP with \(L\) hidden layers. However, key architectural details are missing:

  • the number of subnetworks \(N\)
  • the number of hidden layers \(L\)
  • hidden layer dimensionality

These parameters can significantly affect both model capacity and ensemble behavior, and their absence makes the experiments difficult to reproduce. Moreover, the experiments are conducted on a diverse set of datasets with very different characteristics (e.g., image and tabular data), yet it is unclear whether the same architecture is used across all datasets or whether the architecture is tuned for each dataset.


3. Unclear feature extraction pipeline

The paper states that:

MoCo v3 is used for ImageNet datasets and VGG-16 for all other benchmarks.

However, many datasets listed in the experiments appear to be tabular datasets (e.g., Wine Quality, Adult, Abalone). It is unclear how a convolutional architecture such as VGG-16 is applied to such datasets, or whether the tabular features are used directly.

Clarifying the feature extraction pipeline would improve the transparency of the experimental setup.


4. Limited baseline comparison

The experimental comparison currently includes only two baselines: a single network and a voting ensemble. Since the proposed method specifically aims to mitigate prediction variance induced by dropout masks, it would be more informative to include baselines that directly address stochastic inference, such as Monte Carlo Dropout and standard deterministic dropout inference (i.e., disabling dropout at test time with weight scaling). These baselines would provide a stronger reference point for evaluating the effectiveness of the proposed approach.

In addition, the description of the voting ensemble baseline lacks important implementation details, such as the number of base models, the architectural configuration of each model, and the aggregation strategy (e.g., majority voting or probability averaging). Providing these details would help clarify the strength of the baseline and improve the transparency of the experimental comparison.


5. Unclear inference protocol

The paper describes two inference modes:

  1. single forward pass
  2. multiple stochastic passes with averaging

However, it is unclear which inference strategy is used to produce the reported experimental results.


Questions for the Authors

  1. How is the inference variance defined in the theoretical analysis estimated in the experiments?
  2. What are the specific architectural settings (number of subnetworks \(N\), hidden layers \(L\), hidden dimensions)?
  3. How does the proposed architecture handle different data modalities, such as tabular and image datasets? In particular, how are the input representations constructed for these different data types, and how are feature extractors such as VGG-16 integrated into the pipeline?
  4. Which inference mode (single pass or multiple stochastic passes) is used for the reported results?
  5. How does the proposed method compare with Monte Carlo dropout, deep ensemble or deterministic dropout inference baselines?


Overall Assessment

The paper addresses an interesting problem and proposes a conceptually simple approach combining ensemble learning and consistency regularization. However, several issues remain, including incomplete experimental details, limited baselines, and a mismatch between the theoretical objective and the empirical evaluation.

Addressing these issues would significantly strengthen the paper.

Recommendation: Weak accept.


Confidence

  1. Moderate confidence in the evaluation.