# A polynomial account of Bayesian update

One of the things I do at Topos is make sense of some aspect of the world by articulating it in mathematics. Follow along as I make sense of Bayesian update using the mathematics of polynomial functors.

Mathematical accounting systems are widely used by practitioners to negotiate. For example, stock and flow diagrams account for epidemics and stakeholders use them to negotiate public health policy. Differential equations account for physical laws and scientists use them to hypothesize and run experiments. While these are robust and formal systems, they are siloed. One part of our work at Topos is designing meta-accounting systems — i.e. grammars which house many domain specific grammars. An early example of a meta-grammar is decorated cospan categories (Fong 2015; J. C. Baez, Courser, and Vasilakopoulou 2022) which has been used to define grammars for composing stock and flow diagrams (J. Baez et al. 2023), Hamiltonian and Lagrangian systems (J. C. Baez, Weisbart, and Yassine 2021), and much more.

In the last few years, David has been exploring *polynomial functors* as an exceedingly rich meta-language that can express mathematics for domains like databases, path-finding, and deep learning. In this post, we will show you how polynomial functors account for probabilistic conditioning and Bayesian update. This formal account clarifies the joints of Bayesian update by foregrounding its key ingredients and its assumptions. These joints are important for my own understanding — *what do I care about?* — and communication — *what can we negotiate about?*.

The practicality of a mathematical accounting system is limited by its ergonomics. While a particular branch of mathematics may be powerful as a meta-language, if it is abstruse and requires years of training to be able to write grammatical symbols on the page, then it is not going to help our negotiations more widely. One beautiful thing about polynomial functors is that they are notationally friendly. Grammatical symbols and diagrams are interpretable without a deep understanding of the mathematics that underwrites them. Yet it’s critical that mathematics *does* underwrite them. It means that any symbol can be (metaphorically) *double clicked on* and its formal meaning elaborated. Formality affords precision and takes errors seriously, both of which are necessary for a grounded negotiation.

I’ve double clicked on some of the ideas in the footnotes. If you’d like to double click on anything else, ask in the comments!

## 1 Polynomial functors

A **polynomial functor** consists of a set of **positions**, and for every position a set of **directions**.^{1} We often notate polynomials using a corolla forest, where each position is a corolla whose leaves correspond to the position’s directions.

This is a polynomial with seven positions. The positions have 4, 2, 2, 2, 1, 0, and 0 directions, respectively. We commonly use algebraic notation to express a polynomial.^{2}

This polynomial looks like the kind of polynomial we would see in algebra class, but there are lots of other polynomials! Like this one:^{3}

Because polynomial functors are a meta-grammar, positions and directions have different interpretations depending on the grammar that they are used to describe. Depending on the context a polynomial may represent:

*Questions*, and a set of*answers*for every question.*Problems*, and a set of*solutions*for every problem.*Points*, and a space of*tangent vectors*for every point.*Arguments*, and a set of*return values*for every argument.*Game states*, and a set of*moves*for every game state.

What are maps between polynomials? They are *forwards on positions and backwards on directions*.^{4} Here’s an example of one example of a polynomial map \mathcal{y}^4 + \mathcal{y}^2 \to \mathcal{y}^3 + \mathcal{y}^2 + \mathcal{y}.

We can also create new polynomials from old ones, using the *\mathbin{\triangleleft}-monoidal product*,^{5} aka by stacking trees.

It is not a coincidence that (\mathcal{y}^2 + \mathcal{y}+1) \mathbin{\triangleleft}(\mathcal{y}^2 + 2\mathcal{y}) = (\mathcal{y}^2 + 2\mathcal{y})^2 + (\mathcal{y}^2 + 2\mathcal{y}) + 1.^{6} Hence we often call \mathbin{\triangleleft} the *substitution monoidal product* because we’re substituting one polynomial into another.

We can also use the \mathbin{\triangleleft}-monoidal product several times to create new polynomials. For example,

## 2 Bayesian update

I’d like to introduce to a few polynomials that are the cast of the story we’ll tell about Bayesian update.

The

**maybe monad**is the polynomial \mathcal{y}+ 1. It has two positions, one with a single direction (which we call*continue*) and one with no directions (which we call*error*).^{7}For any set A, we can construct polynomials A\mathcal{y} and \mathcal{y}^A. For example for A = \{a,b,c\} we have:

The

**lottery monad**is the polynomial \mathrm{lott}\coloneqq \sum_{N: \mathbb{N}} \sum_{P: \Delta_N} \mathcal{y}^N where \Delta_N is the set of distributions on N.

The hero of our story is a conditioning map \kappa\colon \mathrm{lott}\mathbin{\triangleleft}(\mathcal{y}+ 1) \to (\mathcal{y}+1) \mathbin{\triangleleft}\mathrm{lott} that is defined by renormalization. On positions, it maps a lottery where each ticket is assigned either *continue* or *error* to:

- The
*error*position, if there is zero probability mass on tickets that do not error. - And otherwise to
*continue*followed by the lottery whose:- Tickets are the tickets assigned a continue token.
- Distribution on tickets is the renormalized distribution on the continuing tickets.

For example,

The conditioning map is the building block of a polynomial map that conditions on symbol matching. We start with a map A\mathcal{y}\mathbin{\triangleleft}\mathrm{lott}\mathbin{\triangleleft}A\mathcal{y}\to (\mathcal{y}+ 1)\mathbin{\triangleleft}\mathrm{lott} that is defined to be the following composite.

This map induces^{8} a map
\mathrm{lott}\mathbin{\triangleleft}A\mathcal{y}\to \mathcal{y}^A \mathbin{\triangleleft}(\mathcal{y}+ 1) \mathbin{\triangleleft}\mathrm{lott}.
\tag{1}

This map takes a lottery in which every ticket is assigned an element of A. Then for each direction a of \mathcal{y}^A, it errors if the tickets assigned with a have zero probability mass and otherwise returns the renormalized lottery containing only those tickets assigned with a.

Here’s how I understand a classic example of Bayesian update.

My friend is tossing a coin. Each coin toss returns the

dataof either heads or tails. Let D = \{\text{heads}, \text{tails}\}. I have twohypotheses: it is either a fair coin or a biased, all-heads coin. Let H = \{\text{fair}, \text{biased}\}. For each hypothesis, I have a distribution over data. These are summarized by alikelihoodP(D | H), such as^{9}\begin{aligned} P(D = \text{heads}| H = \text{fair}) =0.5,& \quad P(D = \text{heads}| H = \text{biased}) = 1.0 \\ P(D = \text{tails}| H = \text{fair}) =0.5, &\quad P(D = \text{tails}| H = \text{biased}) = 0.0 \end{aligned}I start with a guess about the distribution on hypotheses, aka a

priorP(H). Say I’m fairly confident that my friend has a fair coin, P(H = \text{biased}) = 0.1, \quad P(H = \text{fair}) = 0.9. My friend tosses the coin, and I receive a piece of data d \in D. Then, I update the prior to be theposteriorfor each hypothesis h \in H: P(H = h | D = d) = \frac{P(D = d | H = h) P(H = h)}{P(D = d)}. As we repeat this process, my prior will converge to the truth of whether the coin is biased or not.

Let’s give the same example but with polynomials. First, we will interpret the data, hypotheses, likelihood, and prior using our established language.

Then, we define a polynomial map whose key ingredient is the conditioning on symbols map from Equation 1.

In the first example of this map, seeing a heads increases the prior of a biased coin from 0.1 to 0.18. Conversely, seeing a tails guarantees that the coin is fair.

In the second example, we start with a prior that guarantees that the coin is biased. Then seeing a tails, produces an error. This behavior is an example of **Bayesian hell** in which computing the posterior involves division by zero.

This polynomial map is a \mathcal{y}^D \mathbin{\triangleleft}(\mathcal{y}+ 1)-coalgebra with states \mathrm{lott}\mathbin{\triangleleft}H. In other words, it is a dynamical system whose states are the priors, \mathrm{lott}\mathbin{\triangleleft}H, and whose interface is described by \mathcal{y}^D \mathbin{\triangleleft}(\mathcal{y}+ 1). The interface of the system receives a piece of data D and then either updates the state or errors.

## 3 What else?

I hope this post gives you a sense of what I mean when I say “I use math (in this case, polynomial functors) to understand what I care about (in this case, learning via Bayesian update).” Importantly, the math highlights a process for adjusting or extending this articulation. For example, if I want to talk about hierarchical Bayesian update, then:

- Instead of hypotheses and data, I start with a hierarchy of hypotheses H_0, \cdots, H_n where H_{k} is the data to update the hypothesis H_{k+1}.
- Instead of single likelihood map, I start with a likelihoods H_{k+1} \to \mathrm{lott}\mathbin{\triangleleft}H_k for k = 0, \cdots n - 1. And my prior is an element of \mathrm{lott}\mathbin{\triangleleft}H_n.
- I can apply Equation 1 several times to define a map \mathrm{lott}\mathbin{\triangleleft}H_n \to \mathcal{y}^{H_0} \mathbin{\triangleleft}(\mathcal{y}+ 1) \mathbin{\triangleleft}\mathrm{lott}\mathbin{\triangleleft}H_n.

Or suppose I want to talk about active inference where actions can update the hypothesis. We can promote the likelihood to the models of action, sensory data, and hypothesis dynamics that traditionally describe active inference^{10}. Then using an appropriate conditioning map, we can turn those into a coalgebra that receives data and outputs actions.

## References

*Compositionality*4 (3): 39. https://doi.org/10.32408/compositionality-4-3.

*Journal of Mathematical Physics*62 (4). https://doi.org/10.1063/5.0029885.

*Proceedings—Fifth International Conference on Applied Category Theory*, 380:77–96. Electron. Proc. Theor. Comput. Sci. EPTCS.

*Theory Appl. Categ.*30: Paper No. 33, 1096–1120.

## Footnotes

We write a polynomial as p = \sum_{P: p(1)} \mathcal{y}^{p[P]} The set p(1) are the positions of p. For a position P: p(1), the set p[P] are the directions at position P.↩︎

The notation \mathcal{y}^4 + 3\mathcal{y}^2 + \mathcal{y}+ 1 explains why we call these things

*polynomials*but why polynomial*functors*? Each polynomial expresses a functor from \mathbf{Set} to itself. The polynomial p = \sum_{P: p(1)}\mathcal{y}^{p[P]} takes a set X to the set p(X) \coloneqq \sum_{P: p(1)}X^{p[P]}. Elements of p(X) are pairs (P: p(1), f: p[P] \to X).↩︎There are lots of ways of writing the same polynomial. In particular, \mathbb{N} \mathcal{y}^{\mathrm{Bool}} = \sum_{N: \mathbb{N}} \mathcal{y}^ {\mathrm{Bool}} = \sum_{N: \mathbb{N}} \{N\} \mathcal{y}^ {\mathrm{Bool}} Since \{N\} is a one element set, it replaces the implicit 1 in front of each \mathcal{y}^\mathrm{Bool}. Naming the positions like this helps distinguish positions in the corolla forest notation.↩︎

A map \phi: p \to q is a natural transformation between p and q as functors. Explicitly it consists of a map of positions \phi_1: p(1) \to q(1) and then for each position P: p(1), a map of directions \phi^\#_P: q[\phi_1(P)] \to p[P].↩︎

The polynomial p \mathbin{\triangleleft}q is defined to be \sum_{P:p(1)} \prod_{p[P]} \sum_{Q: q(1)} \prod_{q[Q]}\mathcal{y}.↩︎

If you consider p and q as endofunctors on \mathbf{Set}, then the functor underlying p \mathbin{\triangleleft}q is their composite.↩︎

Here is some intuition from programming languages about why this polynomial is called

*maybe*. Suppose that the positions of p represent arguments and the directions represent return values. Then a map p \to \mathcal{y}+ 1 takes every argument to either*continue*or*error*and for arguments mapped to*continue*selects a return value.↩︎In general, a polynomial map A\mathcal{y}\mathbin{\triangleleft}p \to q includes a polynomial map p \to \mathcal{y}^A \mathbin{\triangleleft}q.↩︎

In the example given here, the biased coin in fact only turns up heads. We’re choosing this example so that we can show off how our polynomial story of Bayesian inference accounts for Bayesian hell.↩︎