**About.** I’m a Research Scientist at Salesforce Research. Formerly, Ph.D. candidate in the Center for Vision, Cognition, Learning and Autonomy (VCLA) at UCLA, advised by Prof. Song-Chun Zhu and Prof. Ying Nian Wu. I’ve also spent time at IBM Research, Google Research, and Salesforce Einstein Research. My research is generously supported by the UCLA DYF Fellowship, XSEDE extreme science and engineering grant, and the NVIDIA GPU grant.

**Research interests.** Representation Learning, Generative Models, Unsupervised Learning, Energy Based Models, Variational Approximation, Computer Vision, Natural Language Processing.

**Research themes.** The governing themes of our research are (i) *advance and establish energy-based models* and (ii) *increase the sample efficiency in the learning of LLMs*:

(1) Latent space modelling and sampling.

(2) Variations of MCMC-based learning.

(3) Joint training of EBMs without resorting to MCMC.

(4) Sample efficient learning of large language models.

**Selected publications.**

Learning Energy-based Model with Flow-based Backbone by Neural Transport MCMC

*Nijkamp, Erik*,
Gao, Ruiqi,
Sountsov, Pavel,
Vasudevan, Srinivas,
Pang, Bo,
Zhu, Song-Chun,
and Wu, Ying Nian

*arXiv preprint arXiv:2006.06897*
2020

Learning energy-based model (EBM) requires MCMC sampling of the learned model as the inner loop of the learning algorithm. However, MCMC sampling of EBM in data space is generally not mixing, because the energy function, which is usually parametrized by deep network, is highly multi-modal in the data space. This is a serious handicap for both the theory and practice of EBM. In this paper, we propose to learn EBM with a flow-based model serving as a backbone, so that the EBM is a correction or an exponential tilting of the flow-based model. We show that the model has a particularly simple form in the space of the latent variables of the flow-based model, and MCMC sampling of the EBM in the latent space, which is a simple special case of neural transport MCMC, mixes well and traverses modes in the data space. This enables proper sampling and learning of EBM.

Learning Multi-layer Latent Variable Model via Variational Optimization of Short Run MCMC for Approximate Inference

*Nijkamp, Erik*,
Pang, Bo,
Han, Tian,
Zhou, Linqi,
Zhu, Song-Chun,
and Wu, Ying Nian

*European Conference on Computer Vision (ECCV)*
2020

This paper studies the fundamental problem of learning deep generative models that consist of multiple layers of latent variables organized in top-down architectures. Such models have high expressivity and allow for learning hierarchical representations. Learning such a generative model requires inferring the latent variables for each training example based on the posterior distribution of these latent variables. The inference typically requires Markov chain Monte Caro (MCMC) that can be time consuming. In this paper, we propose to use noise initialized non-persistent short run MCMC, such as finite step Langevin dynamics initialized from the prior distribution of the latent variables, as an approximate inference engine, where the step size of the Langevin dynamics is variationally optimized by minimizing the Kullback-Leibler divergence between the distribution produced by the short run MCMC and the posterior distribution. Our experiments show that the proposed method outperforms variational auto-encoder (VAE) in terms of reconstruction error and synthesis quality. The advantage of the proposed method is that it is simple and automatic without the need to design an inference model.

On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models

*Nijkamp, Erik*,
Hill, Mitch,
Han, Tian,
Zhu, Song-Chun,
and Wu, Ying Nian

*Association for the Advancement of Artificial Intelligence (AAAI)*
2020

This study investigates the effects of Markov chain Monte Carlo (MCMC) sampling in unsupervised Maximum Likelihood (ML) learning. Our attention is restricted to the family of unnormalized probability densities for which the negative log density (or energy function) is a ConvNet. We find that many of the techniques used to stabilize training in previous studies are not necessary. ML learning with a ConvNet potential requires only a few hyper-parameters and no regularization. Using this minimal framework, we identify a variety of ML learning outcomes that depend solely on the implementation of MCMC sampling.
On one hand, we show that it is easy to train an energy-based model which can sample realistic images with short-run Langevin. ML can be effective and stable even when MCMC samples have much higher energy than true steady-state samples throughout training. Based on this insight, we introduce an ML method with purely noise-initialized MCMC, high-quality short-run synthesis, and the same budget as ML with informative MCMC initialization such as CD or PCD. Unlike previous models, our energy model can obtain realistic high-diversity samples from a noise signal after training.
On the other hand, ConvNet potentials learned with non-convergent MCMC do not have a valid steady-state and cannot be considered approximate unnormalized densities of the training data because long-run MCMC samples differ greatly from observed images. We show that it is much harder to train a ConvNet potential to learn a steady-state over realistic images. To our knowledge, long-run MCMC samples of all previous models lose the realism of short-run samples. With correct tuning of Langevin noise, we train the first ConvNet potentials for which long-run and steady-state MCMC samples are realistic images.

Learning Latent Space Energy-Based Prior Model

Pang, Bo*,
Han, Tian*,
*Nijkamp, Erik**,
Zhu, Song-Chun,
and Wu, Ying Nian

*Advances in Neural Information Processing Systems (NeurIPS)*
2020

The generator model assumes that the observed example is generated by a low-dimensional latent vector via a top-down network, and the latent vector follows a simple and known prior distribution, such as uniform or Gaussian white noise distribution. While we can learn an expressive top-down network to map the prior distribution to the data distribution, we can also learn an expressive prior model instead of assuming a given prior distribution. This follows the philosophy of empirical Bayes where the prior model is learned from the observed data. We propose to learn an energy-based prior model for the latent vector, where the energy function is parametrized by a very simple multi-layer perceptron. Due to the low-dimensionality of the latent space, learning a latent space energy-based prior model proves to be both feasible and desirable. In this paper, we develop the maximum likelihood learning algorithm and its variation based on short-run Markov chain Monte Carlo sampling from the prior and the posterior distributions of the latent vector, and we show that the learned model exhibits strong performance in terms of image and text generation and anomaly detection.

Flow Contrastive Estimation of Energy-Based Models

Gao, Ruiqi,
*Nijkamp, Erik*,
Kingma, Diederik P,
Xu, Zhen,
Dai, Andrew M,
and Wu, Ying Nian

*Conference on Computer Vision and Pattern Recognition (CVPR)*
2020

This paper studies a training method to jointly estimate an energy-based model and a flow-based model, in which the two models are iteratively updated based on a shared adversarial value function. This joint training method has the following traits. (1) The update of the energy-based model is based on noise contrastive estimation, with the flow model serving as a strong noise distribution. (2) The update of the flow model approximately minimizes the Jensen-Shannon divergence between the flow model and the data distribution. (3) Unlike generative adversarial networks (GAN) which estimates an implicit probability distribution defined by a generator model, our method estimates two explicit probabilistic distributions on the data. Using the proposed method we demonstrate a significant improvement on the synthesis quality of the flow model, and show the effectiveness of unsupervised feature learning by the learned energy-based model. Furthermore, the proposed training method can be easily adapted to semi-supervised learning. We achieve competitive results to the state-of-the-art semi-supervised learning methods.

Joint Training of Variational Auto-Encoder and Latent Energy-Based Model

Han, Tian,
*Nijkamp, Erik*,
Zhou, Linqi,
Pang, Bo,
Zhu, Song-Chun,
and Wu, Ying Nian

*Conference on Computer Vision and Pattern Recognition (CVPR)*
2020

This paper proposes a joint training method to learn both the variational auto-encoder (VAE) and the latent energy-based model (EBM). The joint training of VAE and latent EBM are based on an objective function that consists of three Kullback-Leibler divergences between three joint distributions on the latent vector and the image, and the objective function is of an elegant symmetric and anti-symmetric form of divergence triangle that seamlessly integrates variational and adversarial learning. In this joint training scheme, the latent EBM serves as a critic of the generator model, while the generator model and the inference model in VAE serve as the approximate synthesis sampler and inference sampler of the latent EBM. Our experiments show that the joint training greatly improves the synthesis quality of the VAE. It also enables learning of an energy function that is capable of detecting out of sample examples for anomaly detection.

Divergence Triangle for Joint Training of Generator model, Energy-based model, and Inferential model

Han, Tian*,
*Nijkamp, Erik**,
Fang, Xiaolin,
Hill, Mitch,
Zhu, Song-Chun,
and Wu, Ying Nian

*Conference on Computer Vision and Pattern Recognition (CVPR)*
2019

This paper proposes the divergence triangle as a framework for joint training of generator model, energy-based model and inference model. The divergence triangle is a compact and symmetric (anti-symmetric) objective function that seamlessly integrates variational learning, adversarial learning, wake-sleep algorithm, and contrastive divergence in a unified probabilistic formulation. This unification makes the processes of sampling, inference, energy evaluation readily available without the need for costly Markov chain Monte Carlo methods. Our experiments demonstrate that the divergence triangle is capable of learning (1) an energy-based model with well-formed energy landscape, (2) direct sampling in the form of a generator network, and (3) feed-forward inference that faithfully reconstructs observed as well as synthesized data. The divergence triangle is a robust training method that can learn from incomplete data.

Learning Non-convergent Non-persistent Short-run MCMC toward Energy-Based Model

*Nijkamp, Erik*,
Hill, Mitch,
Zhu, Song-Chun,
and Wu, Ying Nian

*Advances in Neural Information Processing Systems (NeurIPS)*
2019

This paper studies a curious phenomenon in learning energy-based model (EBM) using MCMC. In each learning iteration, we generate synthesized examples by running a non-convergent, non-mixing, and non-persistent short-run MCMC toward the current model, always starting from the same initial distribution such as uniform noise distribution, and always running a fixed number of MCMC steps. After generating synthesized examples, we then update the model parameters according to the maximum likelihood learning gradient, as if the synthesized examples are fair samples from the current model. We treat this non-convergent short-run MCMC as a learned generator model or a flow model. We provide arguments for treating the learned non-convergent short-run MCMC as a valid model. We show that the learned short-run MCMC is capable of generating realistic images. More interestingly, unlike traditional EBM or MCMC, the learned short-run MCMC is capable of reconstructing observed images and interpolating between images, like generator or flow models. The code can be found in the Appendix.