We will begin with some basic coverage of concepts and terminology from discrete probability.
A discrete event space, Ω, is an enumerable set of mutually
exclusive elementary events,
More generally, we can define compound events that are
sets of elementary events. For example, when rolling a standard die,
we can consider the event A to correspond to rolling an even
value, thus
| Set Notation | Logic Notation | Description |
|---|---|---|
| A ∩ B | A ∧ B | intersection |
| A ∪ B | A ∨ B | union |
| Ω | true | Certain event |
| ∅ | false | impossible event |
We let notation P(A) represent the probability that event A occurs. We then have the following facts:
P(Ω) = 1
In effect, if events
A1,
A2,
...
An are the elementary events of the
space, then
Σi P(Ai) = 1
P(∅) = 0
P(A ∨ B) = P(A) + P(B) - P(A ∧ B)
Note that we must avoid overcounting outcomes in which both A
and B occur.
If A and B are pairwise exclusive events (meaning that they
cannot both occur together), then we have the simpler formula
P(A ∨ B) = P(A) + P(B)
Joint Probability Distribution
We will often consider two or more random variables and their
outcomes. It is common to use notation
to meanP(A,B)
For Boolean valued events, the joint distribution space can be described using four possible outcomes: P(A,B), P(A,¬B), P(¬A,B), and P(¬A,¬B). A common notation is to denote the entire distribution space as P(A,B), with the bold P vs. the usual P. Similar approach can be used for multivalued discrete random variables. For example, we could consider the 12 outcomes that arise if you combine one coin flip with one roll of a die.
Example: Consider a population of only those who come to a doctor with acute stomach pain. Let event App represent that a patient has acute appendicitis, and event Leuko represent that a patient's leukocyte value is greater than some threshold. We can describe the joint probability distribution as
| P(App,Leuko) | App | ¬App |
| Leuko | 0.23 | 0.31 |
| ¬Leuko | 0.05 | 0.41 |
Marginalization
If given the entire joint distribution, it is easy to eliminate a
variable by taking the sum of probabilities over all possible values
of that variable.
For example, we can determine P(Leuko) = P(Leuko,App) + P(Leuko,¬App) = 0.23 + 0.05 = 0.28
| P(App,Leuko) | App | ¬App | Total |
| Leuko | 0.23 | 0.31 | 0.54 |
| ¬Leuko | 0.05 | 0.41 | 0.46 |
| Total | 0.28 | 0.72 | 1 |
Conditional Probability
P(A) represents what is known as the
a priori probability of event A; that is, the
probability of A occurring, without any additional information.
However, it is common to consider a conditional probability
P(A | B), which is the probability that A occurs, given knowledge that
event B occurs.
That is, we restrict the event space to only those elementary events for which B is true, and then we consider the probability of A. (If B occurs with probability 0, then the conditional probability is undefined).
P(A | B) = P(A ∧ B) / P(B).
Example: The speed of 100 vehicles on a particular road is measured, as well as whether the driver is a student. Outcomes were:
What is P(G | S)? Tthat is, what is the probability of speeding, given that it is a student driver?
Event Frequency Relative Frequency Vehicle Observed 100 1 Driver is student (S) 30 0.3 Car is speeding (G) 10 0.1 Speeding Student (S ∧ G) 5 0.05
P(G | S) = P(S ∧ G) / P(S) = 0.05 / 0.3 ≈ 0.17
Another example: revisit appendicitis and leukocytes. The joint
distribution that the above table describes is actually
P(App,Leuko | StomachPain)
as those statistics were gathered only over that segment of the
population. But for our discussion, we will just consider StomachPain
as a presumption of our event space.
What is P(Leuko | App)?
Thus, 82% of appendicitis cases demonstrate elevated leukocytes.P(Leuko | App) = P(Leuko ∧ App) / P(App) = 0.23 / 0.28 ≈ 0.82
Doctor would be more interested in diagnostics. What is P(App | Leuko)?
P(App | Leuko) = P(Leuko ∧ App) / P(Leuko) = 0.23 / 0.54 ≈ 0.43So even for a patient exhibiting stomach pain and with elevated leukocytes, they are still less likely to be suffering an appendicitis.
Independence
Note well that a conditional probability does not describe an causal
effect between two events; it is purely a statistical measurement.
With that said, we say that two events A and B are independent
(again, in a statistical sense), if
Note that
P(A) = P(A | B) = P(A ∧ B) / P(B), and therefore,
P(A ∧ B) = P(A) * P(B).
We can now compute
P(B | A) = P(A ∧ B) / P(A) = P(A) * P(B) / P(A) = P(B).
Chain Rule
The chain can be computed in any order. Of course, if all the events
are mutually independent, then
Bayes Theorem
For any A and B,
P(A | B) = P(B | A) * P(A) / P(B)
This turns out to be a very useful fact for computing conditional probabilities in practice, when P(B | A) is well understood, but P(A | B) is not.
Example: consider again the diagnosis question,
In the real world, we expect that there is no causation by which high
leukocyte level causes an appendicitis. It is instead likely that an
appendicitis may often cause a high leukocyte level. But there are
also many other things that could cause a high leukocyte level beyond
appendicitis. Furthermore, those factors could vary greatly in local
communities and vary over time. So the value of
On the other hand, there may be more universal understanding of
Locally, there may be a more general population that can be used to produce current statistics for P(Leuko) in a community. Then a doctor can combine that result to compute
P(App | Leuko) = P(Leuko | App) * P(App) / P(Leuko)Bayes theorem is also used to compute more reliable probability estimate for events that may be relatively rare (with less reliable empirical evidence), based only on using more well evidenced values.
Probabilistic Reasoning
Question: If you know that P(A) = 0.3 and you know that P(B | A) = 0.6, what would you predict for P(B)?
Answer: There is not enough information to determine P(B).
More generally, assume that
P(A) = α
P(B | A) = β
Consider the full joint distribution
P(A,B) = p1
P(A,¬B) = p2
P(¬A,B) = p3
P(¬A,¬B) = p4
What do we know?
| p1 + p2 = α | as P(A) = P(A,B) + P(A,¬B) | |
| p1 = αβ | as P(A,B) = P(B | A) * P(A) | |
| p1 + p2 + p3 + p4 = 1 | as P(Ω) = 1 |
Strategy: Maximize Entropy
When a system of equations involving a joint probability distribution is under-constrained, yet there is desire to estimate that distribution, one strategy is based on picking a consistent solution that maximizes an information theoretic measure known as entropy.
Informally, the entropy of a probability distribution is a measure of the amount of randomness in the system. For a discrete probability distribution p, entropy H(p) is defined as
H(p) = Σi -pi ln pi.Since each 0 ≤ pi ≤ 1, the term
The idea is that if some probabilities are known but other's aren't, a practical technique for estimating the unknowns is to select them so as to maximize the entropy of the system (informally, hoping to insert as little artificial information as possible into the system).
In general, it is a non-trivial mathematical problem to determine the solution that maximize entropy (requiring differential equations and other techniques that we will not consider in this class). But there are systems that perform such computations.
Returning to the above example we know the correct value for for p1=αβ and p2=&alpha(1-&beta), but not for p3 and p4. It turns out that the maximum entropy prediction is achieved (due to symmetry), with p3 = p4 = (1-α)/2. Returning to our original goal of estimating P(B), we have that P(B) = α(β - 0.5) + 0.5 = 0.3(0.1) + 0.5 = 0.53