分享

Neural Variational Inference: Classical Theory

 htxu91 2016-07-30

As a member of Bayesian methods research group I’m heavily interested in Bayesian approach to machine learning. One of the strengths of this approach is ability to work with hidden (unobserved) variables which are interpretable. This power however comes at a cost of generally intractable exact inference, which limits the scope of solvable problems.

Another topic which gained lots of momentum in Machine Learning recently is Deep Learning, of course. With Deep Learning we can now build big and complex models that outperform most hand-engineered approaches given lots of data and computational power. The fact that Deep Learning needs a considerable amount of data also requires these methods to be scalable — a really nice property for any algorithm to have, especially in a Big Data epoch.

Given how appealing both topic are it’s not a surprise there’s been some work to marry these two recently. In this series of blogsposts I’d like to summarize recent advances, particularly in variational inference. This is not meant to be an introductory discussion as prior familiarity with classical topics (Latent variable models, Variational Inference, Mean-field approximation) is required, though I’ll introduce these ideas anyway just to remind it and setup the notation.

Latent Variables Models

Suppose you have a probabilistic model that’s easy to describe using some auxiliary variables Z that you don’t observe directly (or even would like to infer it given the data). One classical example for this setup is Gaussian Mixture Modeling: we have K components in a mixture, and zn is a one-hot vector of dimensionality K indicating which component an observation xn belongs to. Then, conditioned on zn the distribution of xn is a usual Gaussian distribution: p(xnznk=1)=N(xnμk,Σk) (here whenever I refer to a distribution, you should read it as its density. At least generalized one). Therefore the joint distribution of the model is

p(x,zΘ)=n=1Nk=1KN(xnμk,Σk)znkπkznk

Where π is a probability distribution over K outcomes, and Θ is a set of all model’s parameters (π, μs and Σs).

We’d like to do 2 things with the model: first, we obviously need to learn parameters Θ, and second, we’d like infer these latent variables zn to know which cluster the observation xn belongs to, that is, we need to calculate the distribution of zn conditioned on xn: p(znxn).

We want to learn the parameters Θ as usual by maximizing the log-likelihood. Unfortunately, we don’t know true assignments zn, and marginalizing over it as in p(xn)=k=1Kπkp(xn,znk=1) is not a good idea as the resulting optimization problem would lose its convexity. Instead we decompose the log-likelihood as follows:

log?p(x)=Eq(zx)log?p(x)?const in z=Eq(zx)log?p(x,z)q(zx)p(zx)q(zx)=Eq(zx)log?p(x,z)q(zx)+DKL(q(zx)∣∣p(zx))

The second term is a Kullback-Leibler divergence, which is always non-negative, and equals zero iff distributions are equal almost everywhere q(zx)=p(zx). Therefore putting q(zx)=p(zx) eliminates the second term, leaving us with log?p(x)=Ep(zx)log?p(x,z)p(zx). Therefore all we need to be able to do is to calculate the posterior p(zx), and maximize the expectation. This is how EM algorithm is derived: at E-step we calculate the posterior p(zx,Θold), and at M-step we maximize the expectation Ep(zx,Θold)log?p(x,zΘ)p(zx,Θ) with respect to Θ keeping Θold fixed.

Now, all we are left to do is to find the posterior p(zx) which is given by the following deceivingly simple formula knows as a Bayes’ rule.

p(zx)=p(xz)p(z)p(xz)p(z)dz

Of course, there’s no free lunch and computing the denominator is intractable in general case. One can compute the posterior exactly when the prior p(z) and the likelihood p(xz) are conjugate (that is, after multiplying the prior by the likelihood you get the same functional form for z as in the prior, thus the posterior comes from the same family as the prior but with different parameters), but many models of practical interest don’t have this property. This is where variational inference comes in.

Variational Inference and Mean-field

In a variational inference (VI) framework we approximate the true posterior p(zx) with some other simpler distribution q(zx,Λ) where Λ is a set of (variational) parameters of the (variational) approximation (I may omit Λ and Θ in a “given” clause when it’s convenient, remember, it always could be placed there). One possibility is to divide latent variables z in groups and force the groups to be independent. In extreme case each variable gets its own group, assuming independence among all variables q(zx)=d=1Dq(zdx). If we then set about to find the best approximation to the true posterior in this fully factorized class, we will no longer have the optimal q being the true posterior itself, as the true posterior is presumably too complicated to be dealt with in analytic form (which we want from the approximation q when we say “simpler distribution”). Therefore we find the optimal q(zi) by minimizing the KL-divergence with the true posterior (const denotes terms that are constant w.r.t. q(zi)):

DKL(q(zx)∣∣p(zx))=Eq(zix)[Eq(z?ix)log?q(z1x)q(zDx)p(zx)]=Eq(zix)[log?q(zix)?Eq(z?ix)log?p(zx)?log?f(zix)]+const=Eq(zix)[log?q(zix)1Zf(zix)]?log?Z+const=DKL(q(zix)∣∣1Zf(zix))+const

For many models it’s possible to look into Eq(z?ix)log?p(zx) and immediately recognize logarithm of unnormalized density of some known distribution.

Another cornerstone of this framework is a notion of Evidence Lower Bound (ELBO): recall the decomposition of log-likelihood we derived above. In our current setting we can not compute the right hand side as we can not evaluate the true posterior p(zx). However, note that the left hand side (that is, the log-likelihood) does not depend on the variational distribution q(zx,Λ). Therefore, maximizing the first term of the right hand side w.r.t. variational parameters Λ results in minimizing the second term, the KL-divergence with the true posterior. This implies we can ditch the second term, and maximize the first one w.r.t. both model parameters Θ and variational parameters Λ:

ELBO:L(Θ,Λ):=Eq(zx,Λ)log?p(x,zΘ)q(zx,Λ)

Okay, so this covers the basics, but before we go to the neural networks-based methods we need to discuss some general approaches to VI and how to make it scalable. This is what the next blog post is all about.

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多