BayesAdapter: Being Bayesian, Inexpensively and Robustly, via Bayeisan Fine-tuning

Zhijie Deng    Xiao Yang    Hao Zhang    Yinpeng Dong    Jun Zhu

Tsinghua

Paper | PyTorch code










Abstract

Despite their theoretical appealingness, Bayesian neural networks (BNNs) are falling far behind in terms of adoption in real-world applications compared with normal NNs, mainly due to their limited scalability in training, and low fidelity in their uncertainty estimates. In this work, we develop a new framework, named BayesAdapter, to address these issues and bring Bayesian deep learning to the masses. The core notion of BayesAdapter is to adapt pre-trained deterministic NNs to be BNNs via Bayesian fine-tuning. We implement Bayesian fine-tuning with a plug-and-play instantiation of stochastic variational inference, and propose exemplar reparameterization to reduce gradient variance and stabilize the fine-tuning. Together, they enable training BNNs as if one were training deterministic NNs with minimal added overheads. During Bayesian fine-tuning, we further propose an uncertainty regularization to supervise and calibrate the uncertainty quantification of learned BNNs at low cost. To empirically evaluate BayesAdapter, we conduct extensive experiments on a diverse set of challenging benchmarks, and observe significantly higher training efficiency, better predictive performance, and more calibrated and faithful uncertainty estimates than existing BNNs.



Core Idea

Unfold the learning of a BNN into two steps: deterministic pre-training of the deep neural network (DNN) counterpart of the BNN followed by several-round Bayesian fine-tuning.
Advantages

Deterministic Pre-training

This stage trains a regular DNN via maximum a posteriori (MAP) estimation:
Advanced techniques (e.g., radical data augmentation, batch normalization, data-parallel distributed training) can be freely incorporated to improve the parameter training.

Bayesian Fine-tuning

To render the fine-tuning in the style of training normal NNs, we resort to stochastic variational inference (VI) to update the approximate posterior. Typically, we maximize the evidence lower bound (ELBO):
where q is the approximate posterior initailized as a Gaussian centered at the converged parameters of deterministic pre-training, and p denotes the non-informative parameter prior (we consider an isotropic Gaussian prior without losing generality).
TWO FEATURES that distinguish us from existing variational BNNs and make the fine-tuning user-friendly and robust:

Optimizers with built-in weight decay





Exemplar reparametrization





Uncertainty regularization





Results (predictive performance)





Results (quality of uncertainty estimates)





Some out-of-distribution samples used in validation phase




Citation

Zhijie Deng, Xiao Yang, Hao Zhang, Yinpeng Dong, and Jun Zhu. "BayesAdapter: Being Bayesian, Inexpensively and Robustly, via Bayeisan Fine-tuning". Bibtex