Probabilistic Graphic Models

A brief introduction of probabilistic graphic models, or more precisely, the skeleton of this topic.

In high dimensional case, the full representation of joint distribution may be computationally inefficient for many tasks like marginal distribution and conditional distribution. Specifically, we can assume the variables have some structure and the joint distribution can be written as the product of a set of functions:

We are using graph to represent one (or a class of) probability distribution(s).

1. Directed Graph

1.1. Parameterization

Here we are talking about directed acyclic graph (DAG), a.k.a. Bayesian network in probabilistic graphic model. In DAG, we define

here $\pi_i$ represents the parents of $x_i$. We can show that is in fact the conditional probability , and

1.2. Conditional Independency

If we can define a way to find the conditional independences associsated with given DAG, then any probability distribution associated with this DAG must satisfy all the conditional independences given in graph. However, we never say anything about the independences of subsets of variables that are not represented in the graph.

There are three kinds of three-canonical graphs.

In 1 and 2, , and in 3, although but they are conditional dependent given Y.

Given three-canonical case, we can extend it to a graph search algorithm answering questions about conditional independence of any two variables given any subset of variables. This is known as d-seperation algorithm. The key idea behind this algorithm is that if you can find a path from A to B, then they are dependent, here the path may be blocked by the given variables (1 & 2) or activated by them (3).

1.3. Characterization

For a given undirected graph, we define a family of distributions , by ranging over all possible choices of positive potential functions on the maximal cliques of the graph.

We can also define another family of distributions , by gathering all the distributions that satisfies the conditional independence associated with the graph. It can be stated that these two families are equivalent.

2. Undirected Graph

2.1. Parameterization

Undirected graph is also known as Markov Random Field. We define

where Z is the normalization constant. Note that each clique C consist of nodes that are fully connected, we call as potential function. Usually, Cs are taken as the maximal clique.

Since potential function must be nonnegative, we can remove this restriction by using exponential:

2.2. Conditional Independency

The conditional independence in undirected case is simply the naive graph-theoretic seperation.

2.3. Characterization

For a given undirected graph, we define a family of distributions , by ranging over all possible choices of positive potential functions on the maximal cliques of the graph.

We can also define another family of distributions , by gathering all the distributions that satisfies the conditional independence associated with the graph.

The Hammersley-Clifford Theorem states that these two families are equivalent.

Notice that in general we cannot transfer any directed model to undirected ones, or vice versa.

3. Factor Graph

3.1. Parameterization

Consists of factors and variables . Factor nodes only connect to variable nodes and vice versa. The joint probability is defined as

Here are all the neighbouring variable nodes of factor node .

3.2. Relation with BN and MRF

We can convert any BN and MRF to factor graph. Details can be found in An Introdution to Probabilistic Graphical Models, Chapter 4.

4. Sum-Product Algorithm

We now care about marginalization, i.e.

Notice that any conditional probabiltity is also closely connected with marginalization

here is the impulse function whose support is a one-point set located at .

4.1. Sum-Product on a Tree

First notice that, we can ignore the directionality in DAG and treated them as undirected ones when doing marginalization. In a tree, the cliques only consist of one-node and two-node cliques. Therefore,

on a tree with nodes and edges . Therefore, if we are trying to get , we can set as the root, and eliminate from the leave. The idea is gained from the calculation order below:

we can add impulse function such that to obtain conditional probability.

If we set a message pass protocal as: a node can send message to a neighbouring node when and only when it has received message from all of its other neighbours. Here the message denoted as the summation intermediate factor from its neightbouring nodes.

In a tree graph, we simply send message from leaves to root and then send back from root to leaves, then we can get all the marginal distribution (any node can be seen as a root).

The message is defined as

4.2. Sum-Product in Factor Tree Graph

Just one note: the message send from factor nodes to variable nodes in converted factor tree graph is the same as the the message send from variable node to the corresponding variable node in the pre-converted undirected graph.

4.3. Sum-Product in Tree-like Graph

We can treat some clique as one single high-dimensional variable and use the tree sum-product algorithm.

4.4. Sum-Product in General Case

We can use the same protocal in general case and send mesaage again and again, it can be shown that in general sum-product does not converge to the true margin but:

A set of beliefs gives Sum-Product a fixed point in any graph G if and only if they are stationary points of the Bethe free energy.

Details and Bethe free energy should be refered to Joan Bruna's Lecture Notes.

5. Max-Product

What happens if we want to maximize a posteriori (MAP) probability instead of marginalization.

In fact, we can use the same algorithm, use max instead of sum. The logic behind this is that

since "sum-product" and "max-product" pair are examples of an algebraic structure known as "commutative semiring".

6. Reference

This browser does not support PDFs.
Please download the PDF to view it: Download PDF.

results matching ""

    No results matching ""