Bayesian Networks

Consider a joint probability distribution over discrete multivalued random variables. If we know the actual probability for each form P(A=a,B=b,C=c,...), then we know the complete distribution. We could then compute any marginalized or conditional probability. Unfortunately, if there are n variables, and each has b possible values, the joint probability distribution has nb distinct outcomes. For large system, it would be prohibitive to assume that the entire distribution could be modeled and stored in that way, and there also might not be sufficient empirical data for each outcome.

If we knew that the random variables were mutually independent, then we need only store the n probabilities of the form P(A). All other quantities, such as P(A,B,C) could be easily computed. Of course, many interesting distributions are not based on mutually independent variables.

We wish to again consider how to model a joint distribution for which we have some partial information, and some presumption of conditional dependencies, while still being able to answer queries about various conditional probabilities. We look at a very prominent approach known as a Bayesian Network, which combines some basic domain knowledge about variable independence to produce a more sparse representation of a distribution.

Classic Example: Bob has a house alarm, and has neighbors John and Mary who will call him if they think they hear the alarm. An alarm might be set off by a burglar, or by a (rare) Earthquake. We introduce the following Boolean variables as notation for the model:

Let's further assume that the following probabilities are known based on experience:

What if we are interested in answering queries such as the following?

Without any more knowledge about the system, we cannot give definitive answers to these queries (although we might apply techniques to estimate them).

But what if we add the additional assumptions based on knowledge of the domain:

Conditional Independence
We originally defined pairwise independence of A and B if if
P(A | B) = P(A).
For example, we are assuming that earthquakes and burglaries are independent, thus
P(B | E) = P(B) = 0.001

However, the condition about John and Mary not knowing about the other's actions does not mean that those variables are independent.

P(M) ≠ P(M | J)
(Why is that?)

To express our assumption about John and Mary, we need a new definition of conditional independence, as follows. Variables A and B are conditionally independent, given C, if
P(A | B,C) = P(A | C)
which is equivalent to each of the following statements:
P(B | A,C) = P(B | C)
P(A,B | C) = P(A | C) * P(B | C)

We can now use this definition to describe what we want to say about John and Mary, which is
P(J | M,A) = P(J | M)

Bayesian Network
We will use a graph to compactly represent our knowledge about the domain, with a vertex for each variable, and directed edges to describe one variable that directly impacts another.

(Figure 7.11 in text is much prettier than I'm willing to draw)

B       E
 \     /
  \   /
   | |
   v v
    A
   / \
  /   \
 |     |
 v     v
 J     M

With each edge, we specify the Conditional Probability Table (CPT)

The beauty of a Bayesian Network is that it gives a compact representation for the entire joint probability distribution (assuming the assumptions about independence and conditional probabilities are valid). In particular, we can compute any specific value of the joint probability distribution, as:

P(x1, x2, ... xn) = Πi P(xi | parents(Xi))
This equation is effectively a result of the chain rule, and a theorem about conditional independence (given below).

Example:
P(J,M,A,¬B,¬E) = P(J | A) * P(M | A) * P(A | ¬B,¬E) * P(¬B) * P(¬ E)
= 0.90 * 0.70 * 0.001 * 0.999 * 0.998 = 0.000628

Once we have the ability to compute atomic probabilities in the joint distribution, we can marginalize and condition to compute any probability of interest. Often this can be done far more directly than by working bottom-up from the atomic probabilities in the joint distribution.

Example: Compute P(J | B).
By definition this is P(J,B) / P(B).
We are given P(B) as unconditional probability
P(J,B) = P(J,B,A) + P(J,B,¬A).

We can further apply the chain rule on term P(J,B,A) as

P(J,B,A) = P(J | B,A) * P(A | B) * P(B).
We note that P(J | B,A) = P(J | A) due to conditional independence of J and B given A. Similarly, we have P(J,B,¬A) = P(J | ¬A) * P(¬ A | B) * P(B). Thus far, we have that
P(J | B) = ( P(J | A) * P(A | B) * P(B) + P(J | ¬A) * P(¬A | B) * P(B) ) / P(B)
= P(J | A) * P(A | B) P(J | ¬A) * P(¬A | B)

We still need P(A | B). We can marginalize with earthquake to have

P(A | B) = ( P(A,B,E) + P(A,B,¬E) ) / P(B)
= ( P(A | B,E)*P(B)*P(E) + P(A | B,¬E)*P(B)*P(¬E) ) / P(B)
= P(A | B,E)*P(E) + P(A | B,¬E)*P(¬E)
= 0.95 * 0.002 + 0.94 * 0.998 = 0.94.
We can similarly get P(¬A | B) = 0.06, and the original conclusion that
P(J | B) = 0.9 * 0.94 + 0.05 * 0.06 = 0.849

Theorem: A variable for a node x is conditionally independent of all its non-descendent nodes, given its parents. That is

P(xi,xj | parents(Xi)) = P(xi | parents(Xi)) * P(xj | parents(Xi))

Markov blanket: We define the Markov blanket of a node to be a node's parents, children, and children's parents.

Theorem: A variable for a node x is conditionally independent of all its non-descendent nodes, given its We define the Markov blanket of a node to be a node's parents, children, and children's parents.


Michael Goldwasser
Last modified: Tuesday, 29 October 2013