Med Library . org

Open Source Encyclopedia

F-divergence

Welcome to MedLibrary.org. For best results, we recommend beginning with the navigation links at the top of the page, which can guide you through our collection of over 14,000 medication labels and package inserts. For additional information on other topics which are not covered by our database of medications, just enter your topic in the search box below:

In probability theory, an ƒ-divergence is a function Df (P  || Q) that measures the difference between two probability distributions P and Q. It helps the intuition to think of the divergence as an average, weighted by the function f, of the odds ratio given by P and Q.

These divergences were introduced and studied independently by Csiszár (1963), Morimoto (1963) and Ali & Silvey (1966) and are sometimes known as Csiszár ƒ-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances.

Contents

Definition

Let P and Q be two probability distributions over a space Ω such that P is absolutely continuous with respect to Q. Then, for a convex function f such that f(1) = 0, the f-divergence of Q from P is defined as

 D_f(P\parallel Q) \equiv \int_{\Omega} f\left(\frac{dP}{dQ}\right)\,dQ.

If P and Q are both absolutely continuous with respect to a reference distribution μ on Ω then their probability densities p and q satisfy dP = p dμ and dQ = q dμ. In this case the f-divergence can be written as

 D_f(P\parallel Q) = \int_{\Omega} f\left(\frac{p(x)}{q(x)}\right)q(x)\,d\mu(x).

Instances of f-divergences

Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence, coinciding with a particular choice of f. The following table lists many of the common divergences between probability distributions and the f function to which they correspond (cf. Liese & Vajda (2006)).

Divergence Corresponding f(t)
KL-divergence  t \ln t \, , -\ln t
Hellinger distance (\sqrt{t} - 1)^2,\,2(1-\sqrt{t})
Total variation distance |t - 1| \,
 \chi^2-divergence (t - 1)^2,\,t^2 -1
α-divergence \begin{cases}
    \frac{4}{1-\alpha^2}\big(1 - t^{(1+\alpha)/2}\big), & \text{if}\ \alpha\neq\pm1, \\
    t \ln t, & \text{if}\ \alpha=1, \\
    - \ln t, & \text{if}\ \alpha=-1
  \end{cases}

Alpha divergences defined on positive arrays are representational Bregman divergences (cf. Nielsen & Nock (2009)).

Properties

  • Non-negativity: the ƒ-divergence is always positive; it's zero if and only if the measures P and Q coincide. This follows immediately from Jensen’s inequality:
    
    D_f(P\!\parallel\!Q) = \int \!f\bigg(\frac{dP}{dQ}\bigg)dQ \geq f\bigg( \int\frac{dP}{dQ}dQ\bigg) = f(1) = 0.
  • Monotonicity: if κ is an arbitrary transition probability that transforms measures P and Q into Pκ and Qκ correspondingly, then
    
    D_f(P\!\parallel\!Q) \geq D_f(P_\kappa\!\parallel\!Q_\kappa).
    The equality here holds if and only if the transition is induced from a sufficient statistic with respect to {P, Q}.
  • Convexity: for any 0 ≤ λ ≤ 1
    
    D_f\Big(\lambda P_1 + (1-\lambda)P_2 \parallel \lambda Q_1 + (1-\lambda)Q_2\Big) \leq \lambda D_f(P_1\!\parallel\!Q_1) + (1-\lambda)D_f(P_2\!\parallel\!Q_2).

References