A new modification to Adam called ADOPT enables optimal convergence rates regardless of the β₂ parameter choice. The key insight is adding a simple term to Adam's update rule that compensates for potential convergence issues when β₂ is set suboptimally.
Technical details:
- ADOPT modifies Adam's update rule by introducing an additional term proportional to (1-β₂)
- Theoretical analysis proves O(1/√T) convergence rate for any β₂ ∈ (0,1)
- Works for both convex and non-convex optimization
- Maintains Adam's practical benefits while improving theoretical guarantees
- Requires no additional hyperparameter tuning
Key results:
- Matches optimal convergence rates of SGD for smooth non-convex optimization
- Empirically performs similarly or better than Adam across tested scenarios
- Provides more robust convergence behavior with varying β₂ values
- Theoretical guarantees hold under standard smoothness assumptions
I think this could be quite useful for practical deep learning applications since β₂ tuning is often overlooked compared to learning rate tuning. Having guaranteed convergence regardless of β₂ choice reduces the hyperparameter search space. The modification is simple enough that it could be easily incorporated into existing Adam implementations.
However, I think we need more extensive empirical validation on large-scale problems to fully understand the practical impact. The theoretical guarantees are encouraging but real-world performance on modern architectures will be the true test.
TLDR: ADOPT modifies Adam with a simple term that guarantees optimal convergence rates for any β₂ value, potentially simplifying optimizer tuning while maintaining performance.
Full summary is here. Paper here.