Bayesian Decision Theory With Discrete Inputs

In BDT, we try to model some statistical phenomenon that may or may not be deterministic depending on your ontological position 1. Either way, we cannot model them precisely because we do not have all the information (there is at least one unobserved variable), so we model it as though it were random using the information we do have (observed variables) and use probability theory. We gather data, count the correlations in our observations and use the ratios as probabilities.

Bayes’ Rule

Notation

Meaning

\(P(x)\)

Probability of \(x\) occurring.

\(P(x | y)\)

Probability of \(x\) occurring given that \(y\) occured already.

So, say we have some data on observed variables and matching classifications – e.g. people’s income level (low/medium/high), savings (small/medium/large), and whether they belong to a class of high or low-risk customers 2. Some relevant terms for the probabilities we could derive from this data are:

Term

Notation

Meaning

Prior

\(P(C_i)\)

Probability of a person belonging to class \(C_i\) before knowing anything about them.

Evidence/marginal

\(P(x)\)

Probability of \(x\) being a property of any person in general.

Likelihood

\(P(x | C_i)\)

Probability of \(x\) being a property of a person of class \(C_i\).

Posterior

\(P(C_i | x)\)

Probability of a person belonging to class \(C_i\) given that they have property \(x\).

The first three of these we can learn from the data simply by counting how many times they occur. The last one, the posterior probability, is what we are interested in predicting for potential new customers. We can derive it using Bayes’ rule:

\[P(C_i|x) = \frac{P(C_i) P(x|C_i)}{p(x)} \]

Beyond giving this algorithm its name, Bayes’ rule is fundamental in statistics in general. It’s a good idea to look into it if you haven’t already.

Tip

Grant Sanderson of 3Blue1Brown made a great video on Bayes’ Theorem: watch it here.

Now that we can know the posteriors for every class given some features of that person we can just look at which one has the highest probability and make a decision based on that conclusion, right? Wrong! We are working with probabilities here, so even though all the arrows may point in one direction, the end result could subvert our expections!

Cost and risk

Decisions can be costly, even if you believe you are making the right call. If \(\alpha_i\) is the decision to classify some input as \(C_i\) and \(\lambda_{ik}\) is the loss incurred when it turns out it was actually \(C_k\), we define the expected risk \(R\) for this action as:

\[R(\alpha_i | x) = \sum_{k=1}^K \lambda_{ik} P(C_k | x)\]

We may then choose the action a_i with the minimum risk out of all possible actions.

The mapping of loss for each decision (\(\lambda_{ik}\)) is called a loss function. If there is no loss when you pick the right class – great! \(\lambda_{ik}\) will be \(0\) for that \(i\), thus neatly factoring \(\lambda P(C_i | x)\) out of the expected risk sum. If there is still some cost associated with picking the right class, also great! The cost is factored in to the expected risk assessment; we pick the action with the minimum expected risk.

Reject: noping out

The cost of a wrong decision may sometimes be so high that you only want the classifier to make that decision if the calculated posterior probability is above a certain level. But what if none of the posteriors exceeds the desired certainty? You instruct the algorithm to nope out – the _reject_-case, defined \(\alpha_{K+1}\). The classification job will have to be carried out some other way, perhaps by a human.

If this option is in play, the expected risk for rejecting is defined simply as \(\alpha_{K+1} = \lambda\). Since the sum of the expected risks for all other actions is \(1\), the meaningful tuning of \(\lambda\) lies between \(0\) – where there is no risk in rejecting, so we always reject – and \(1\) – where rejecting is so risky we never reject.

There’s more

So far I have only discussed the case where the inputs are discrete. But what if you have a continuous input, like the temperature or exact level of income? To be continued (possibly).

Footnotes

1

“Things are just those things in a stable way. Reality is governed by laws.” versus “There is no such thing as things being things. There are no laws that things have to be logical or consistent. It just seems that way for reason \(x,yz\).”

2

This is going to be based on some other calculation of their history that we are not thinking about too much right now. It’s a good/bad variable. Look, you are being educated to be a money-making office monkey, not to think about ethics, yugh.