This paper studies the instability of predictions in neural networks with dropout and formulates the Randomized Classification (RC) problem, where different dropout masks may lead to inconsistent predictions for the same input. The authors propose DE-RC, a method that employs multiple subnetworks with different dropout masks and introduces a consistency regularization term to reduce prediction variance across masks.
The paper also provides a theoretical analysis relating ensemble variance, mask correlation, and RC probability. Empirically, the method is evaluated on 25 datasets and compared with two baselines: a single network and a voting ensemble.
Overall, the goal of the paper is to demonstrate that reducing prediction variance across dropout masks leads to more deterministic inference while maintaining classification performance.
Interesting problem formulation.
The paper highlights an important yet often overlooked issue in neural
networks with dropout, namely the instability of predictions caused by
random dropout masks. The authors formalize this phenomenon as the
Randomized Classification (RC) problem, which provides a clear
conceptual framework for studying prediction inconsistency.
Novel training perspective for addressing dropout-induced
variance.
The key idea of suppressing dropout-induced prediction variance during
training, rather than relying solely on inference-time averaging, is
interesting and potentially useful for improving deterministic
inference.
Solid theoretical motivation.
The paper provides theoretical analysis to support the proposed
approach. In particular, the variance decomposition separating inference
variance and training variance helps identify the core source of the RC
problem. The analysis of ensemble variance and subnetwork correlation
provides further intuition for why disagreement among dropout
subnetworks should be reduced.
Theory-informed method design.
The proposed framework is clearly motivated by the theoretical insights
presented earlier in the paper. The use of multiple subnetworks and
consistency regularization is aligned with the goal of reducing
prediction disagreement across dropout masks.
Conceptually simple and easy-to-integrate
approach.
The proposed method does not require major architectural changes and can
be incorporated into standard neural network training pipelines, which
may make it attractive for practical applications.
Relatively extensive empirical evaluation.
The method is evaluated on a large number of datasets (25 in total)
covering multiple data domains. The experiments consider several
evaluation criteria, including predictive performance, stability,
robustness, and computational cost.
Potential practical relevance.
Reducing prediction variance caused by dropout may improve the
reliability of neural network inference, which can be important for
applications where deterministic behavior is desirable. —
The theoretical analysis focuses on reducing the inference variance induced by dropout masks, expressed as
\[ Var_r(\hat{y}(r)). \]
However, the experiments report variance across dataset partitions, which conflates several sources of randomness, including:
Therefore, the reported variance does not directly correspond to the inference variance analyzed in the theoretical section, making the empirical validation of the theory less convincing.
The paper describes the proposed architecture as consisting of multiple subnetworks, each implemented as an MLP with \(L\) hidden layers. However, key architectural details are missing:
These parameters can significantly affect both model capacity and ensemble behavior, and their absence makes the experiments difficult to reproduce. Moreover, the experiments are conducted on a diverse set of datasets with very different characteristics (e.g., image and tabular data), yet it is unclear whether the same architecture is used across all datasets or whether the architecture is tuned for each dataset.
The paper states that:
MoCo v3 is used for ImageNet datasets and VGG-16 for all other benchmarks.
However, many datasets listed in the experiments appear to be tabular datasets (e.g., Wine Quality, Adult, Abalone). It is unclear how a convolutional architecture such as VGG-16 is applied to such datasets, or whether the tabular features are used directly.
Clarifying the feature extraction pipeline would improve the transparency of the experimental setup.
The experimental comparison currently includes only two baselines: a single network and a voting ensemble. Since the proposed method specifically aims to mitigate prediction variance induced by dropout masks, it would be more informative to include baselines that directly address stochastic inference, such as Monte Carlo Dropout and standard deterministic dropout inference (i.e., disabling dropout at test time with weight scaling). These baselines would provide a stronger reference point for evaluating the effectiveness of the proposed approach.
In addition, the description of the voting ensemble baseline lacks important implementation details, such as the number of base models, the architectural configuration of each model, and the aggregation strategy (e.g., majority voting or probability averaging). Providing these details would help clarify the strength of the baseline and improve the transparency of the experimental comparison.
The paper describes two inference modes:
However, it is unclear which inference strategy is used to produce the reported experimental results.
The paper addresses an interesting problem and proposes a conceptually simple approach combining ensemble learning and consistency regularization. However, several issues remain, including incomplete experimental details, limited baselines, and a mismatch between the theoretical objective and the empirical evaluation.
Addressing these issues would significantly strengthen the paper.
Recommendation: Weak accept.