Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Comparisons
:::info
Authors:
(1) Yinwei Dai, Princeton University (Equal contributions);
(2) Rui Pan, Princeton University (Equal contributions);
(3) Anand Iyer, Georgia Institute of Technology;
(4) Ravi Netravali, Georgia Institute of Technology.
:::
Table of Links
Abstract and 1 Introduction
2 Background and Motivation and 2.1 Model Serving Platforms
2.2 Early-Exit Models
2.3 Challenges
3 Design
3.1 Preparing Models with Early Exits
3.2 Accuracy-Aware Threshold Tuning
3.3 Latency-Focused Ramp Adjustments
4 Implementation
5 Evaluation and 5.1 Methodology
5.2 Overall Results
5.3 Comparison with Existing EE Strategies
5.4 Microbenchmarks
6 Additional Related Work
7 Conclusion, References, Appendix
5.3 Comparison with Existing EE Strategies
We compare Apparate with two off-the-shelf EE models: BranchyNet [53] and DeeBERT [57]. BranchyNet extends ResNet models with ramps of the same style as Apparate, while DeeBERT extends BERT-base with deeper ramps (using the entire BERT pooler, as described in §3.1). For each, we follow their prescribed architectures, with ramps after every layer that are always active. We perform one-time tuning of thresholds as recommended by both works, and consider two variants: the default recommendation where all ramps must use the same threshold, and a more flexible version
\
\
\
\
that removes this restriction (+). For both, threshold tuning is done optimally (via grid search), and is based on uniformly sampled data across the workload. For fair comparison, Apparate’s ramp budget is configured to support ramps at all layers (though it never does so).
\
Table 2 presents our results. The main takeaway is that existing EE approaches, even when favorably tuned, yield unacceptable drops in average accuracy up to 23.9% and 17.8% for CV and NLP. In contrast, Apparate consistently meets the imposed accuracy constraint (1% in this experiment) for both workloads. Further, even with such accuracy violations, tail latencies are 0.9-9.4% lower with Apparate than with these systems. The reason is again lack of adaptation: all ramps are always active despite their current efficacy which vary dramatically over time (§2.3), yielding undue overheads for large numbers of non-exiting inputs. In contrast, throughout these experiments, despite having a full ramp budget, Apparate maintained only 9.1-27.2% of all possible ramps.
\
For fair median latency comparison, we consider an optimally-tuned (opt) version of existing EE models that perform one-time tuning on the actual test dataset, picking the best (latency-wise) thresholds that ensure <1% accuracy drop. As shown, due to its regular and less-constrained adaptation, Apparate outperforms even this oracle version of existing EEs with up to 14.1% higher median latency savings.
\
\
\
:::info
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
:::
\
Welcome to Billionaire Club Co LLC, your gateway to a brand-new social media experience! Sign up today and dive into over 10,000 fresh daily articles and videos curated just for your enjoyment. Enjoy the ad free experience, unlimited content interactions, and get that coveted blue check verification—all for just $1 a month!
Account Frozen
Your account is frozen. You can still view content but cannot interact with it.
Please go to your settings to update your account status.
Open Profile Settings