You know what, let's draw Venn diagrams. Venn diagrams always work. Well, nothing is ever going to work with you of course, but they tend to work in general so perhaps for other readers.
All we'll be needing here is the definition of a probability space, so open your textbooks on page 1, or open a tab on your browser
here.
1. The universe
Behold, the universe! As you might have guessed, the term universe here doesn't mean the physical universe, it means the set of all possibilities we choose to consider. You can think of each pixel in that ellipse as representing a possibility. We'll call the universe U.
For example, suppose we are going to toss a coin, then we might consider the possibilities of it either landing heads or tails and some possibilities would include:
- "coin tumbles on the table, hits a glass on the middle of the table and ends up landing tails."
- "coin flips over a couple of times and ends up landing tails on the edge of the table."
- "coin tumbling about on the table with tails up, but then right before it stops someone bumps into the table, the coin flies off, bounces against the wall, and eventually comes to rest on the floor landing heads instead."
... and so on and so forth.
Note that we didn't consider for example "putting the coin back in your pocket without ever tossing it in the first place" but that's totally fine, the universe is freely chosen. Also note that even though, for simplicity and intuition, we're using a coin toss here as an example, the universe can be entirely abstract without any relation to the real world.
Some small notational asides:
- Statements in quotes (like above) define subsets of the universe by being considered as predicates: "Q" = {x in U | Q(x) }.
- In some textbooks (as on the wiki) the term sample space is used instead of universe, as it is usually used in the subfield of statistics. However we're doing the general case here so we'll stick with universe.
- Usually the term here is outcomes rather than possibilities, but we'll switch to that later since the term outcome might lead to a misunderstanding at this stage.
2. Events.
The term events here, as you might expect, doesn't mean physical events. Events are subsets of the universe, ie sets of possibilities.
Suppose we are considering the possibility that a coin may or may not have a soul. Every possibility that we have earlier now gets split into two possibilities, for every "x" above we now have "x and coins have a soul" and "x and coins do not have a soul". Think of it as the U we had earlier now having become ~S and getting mirrored as S, and the new U being the combination of both of these.
So in this case S is "coins have a soul" and ~S is "coins do not have a soul". For example in S we would have:
- "coin tumbles on the table, hits a glass on the middle of the table and ends up landing tails, and coins have souls"
- and so on...
and in ~S we would have:
- "coin tumbles on the table, hits a glass on the middle of the table and ends up landing tails, and coins do not have souls"
- and so on...
Now we can start using the proper term outcomes instead of possibilities. Remember here that the two examples above would be indistinguishable to us yet they are distinct outcomes. Outcomes here means possibilities we choose to consider, it doesn't necessarily mean singular indistinguishable outcomes in an observational sense.
3. Probability
Let's first define two more events, H and ~H. H is "the coin landed heads" and ~H is "the coin did not land heads" (ie landed tails). As you might have noticed by now, the notation ~ denotes set complement. Basically for any event X, there is an event ~X which consists of all the outcomes in U but not in X. Or differently, they can be defined by negating the predicate, ie "coin landed heads" vs "coin did not land heads" or "coins have souls" vs "coins do not have a soul". Defining one implicitly defines the other.
Now, for probability. A probability function is a function from events to real numbers between 0 and 1. The probability of the universe is by definition 1: P(U) = 1. In the Venn diagrams the relative area denotes relative probabilities, so P(S) = P(~S) = 0.5 and so on.
Note particularly that the domain of P is a set of events, called a sigma-algebra but ignore that for now. Hypotheses do not have probabilities, data do not have probabilities, models do not have probabilities,
only events have probabilities.
However, as per point 2 above we do have a way to "translate" hypotheses or data into events. Consider the hypothesis "coins have a soul" which translates into the event S = {x in U | "coins have a soul"}. The same goes for data such as "the coin landed heads" which translates into the event H = {x in U | "the coin landed heads"}. With this you can see why there's no real distinction between hypotheses and data, because what has probabilities isn't hypotheses/data but sets of outcomes defined by asserting the hypothesis/data as a predicate on the universe.
4. Conditional Probability
Strictly speaking a conditional probability isn't a probability at all, it's a bit of notation abuse. You might wonder, why would mathematicians abuse notation for this? Well, one answer is that the problem goes away in more general measure spaces, but there still should be a really good reason for notation abuse in probability spaces. The reason is that everything that came before is all nice and well, but what we really want to do here is inference, we want to be able to use all this so that we can learn from information we receive.
This is probably going to be best understood in terms of exclusion of possibilities, so here goes:
Suppose we tossed the coin and it landed heads. Remember that H is "the coin landed heads" and ~H is "the coin did not land heads", and given that we now consider "the coin landed heads" to be true we also consider "the coin did not land heads" to be false.
So we take our eraser and start going through U and erasing every possibility that is now not a possibility anymore, after all if we now consider "the coin did not land heads" to be false then for example the following possibility: "coin tumbles on the table, hits a glass on the middle of the table and ends up landing tails" isn't a possibility anymore. After we've done that we end up with this new U:
We used to consider P(S) = P(~S) = 0.5, but now we'd like to know P(S') and P(~S'). That's the basis of inference, asserting a certain predicate to be true over the universe (or equivalently, excluding/erasing the possibilities for which the predicate is false) and finding the new P(X') for all X we were choosing to consider. Note that the basis for our assertion of said predicate doesn't have to be observation of data, if for example we had access to an oracle and it says "coins have souls" then we'd have erased ~S and ended up with S as the new U.
This is what conditional probabilities are for. That activity of erasing all possibilities that are not in an event is called conditioning on an event. For the case above we would have:
P(S') = P(S|H)
P(~S') = P(~S|H)
and so on. Here you can see the notation abuse, remember that only events have probabilities, yet S|H or ~S|H aren't events. However since this is such a core feature of the entire thing that just gets passed over, but it's important to understand nevertheless. And this is where Bayes' theorem comes in, since it gives us an expression for those conditional probabilities:
P(S') = P(S|H) = P(S) * P(H|S) / P(H)
and so on. In our case above I've put P(H|S) < P(H|~S), basically the assumption that having a soul makes a coin more likely to land tails, and as such when we assert "the coin landed heads" and pick up our eraser that leads us to finally conclude P(~S') > P(S'). We've inferred that coins likely don't have souls by observing one landing heads.
5. Conclusion
Bayesian inference is simple. If you think of having a bunch of possibilities, then start to consider some predicate about them as true, and taking your eraser and erasing all possibilities which have now become impossible and going "This is my new universe now!" with whatever is left, then you understand the gist of it.
What to take home from this:
- The universe is freely chosen. (which by itself debunks that P(E) = 1 stuff)
- The universe does not need to be outcomes of an experiment, that's just one application but the theory is much more general and abstract. For example the universe could be some function space.
- Inference consists of asserting a predicate to be true over the universe, getting a new universe from that and updating all the probabilities of events you're choosing to consider. The basis for considering the predicate to be true does not need to be observational data, again that's just one application but the theory is much more general and abstract. For example the assertion could be a theorem we've proven in said function space.
- Don't let mish-mashed and confused expositions lead you to think that this is difficult or something, the core ideas here are perfectly simple.