Data Operations are Functorial Semantics

double categories
databases
knowledge representation
Author

Michael Lambert

Published

2022-09-06

Abstract

Double categories illustrate that data munging is functorial semantics. This post reviews the functional and relational approaches to ologs and database schemas. Double-categorical ologs are meant to incorporate aspects of each. These should also be viewed as double-categorical database schemas and a data instance is a structure-preserving double functor valued in relations. Double-categorical ologs will be cartesian equipments which may or may not be asked to have tabulators. Viewing a double category as a theory and an instance as a model, usual data operations such as select, filter and join are seen to be captured as a result of the fact that the data instance is structure-preserving. This is illustrated via a number of examples.

My goal in this post is to convince you that double categories are good for knowledge representation, databases, and data manipulation. I’ll explain how using double categories and looking at things from the standpoint of model theory allows us to combine the best aspects of some existing approaches. The take-away is that although understanding the double category machinery requires a bit of overhead investement, the payoff is worth it.

For me, a double category \mathbb D has objects, arrows, proarrows and cells. The proarrows compose but associativity holds only up to coherent isomorphism. A typical cell is of the form

and has m and n as its (internal) domain and codomain; and f and g as its (external) source and target. It’s OK if you want to think of generic double categories as being strict since the implementations I’m discussing will be strict. This post has a bit of theory, but I’ve tried to illustrate it with examples from the Stargate franchise. This is more fun than the standard employee-employer examples you see around. You can browse SGCommand for some background.

1 What is Knowledge Representation?

I’m understanding “knowledge representation” in terms of “ontology logs”, or “ologs” for short, which are represented mathematically as diagrams relating concepts and their properties or attributes (Kent and Spivak 2012). Abstract noun phrases are typically represented as labelled boxes and then arrows can represent attributes, properties or relations between them. A simple olog is one from (Kent and Spivak 2012) about the amino acid arginine:

These ologs are thought of as expressing entities, functional relationships between entities, and facts about those. But not every relationship between concepts that we might like to model is functional. “Friend of” or “enemy of,” for example, are both honest relations on the type of persons. Relations form a different structure that ends up being something like a bicategory. Evan Patterson, my supervisor at Topos, wrote a paper formalizing this approach as “relational ologs” (Patterson 2017) using the notion of a bicategory of relations (Carboni and Walters 1987).

Functional ologs are defined as (finitely presented) categories, whereas relational ologs involve more machinery. The latter end up being (finitely presented) compact closed symmetric monoidal categories with extra diagonal and codiagonal structure built into being a bicategory of relations. Functional ologs are thought of as database schemas (Spivak 2012) and are “instanced” by set-valued functors, which populates the schema with actual data. Relational ologs are thought of as bicategorical database schemas and are instanced by homomorphisms to the bicategory of relations. Data migration for functional ologs is handled by left and right Kan extensions; the analogue for relational ologs isn’t clear. Most of the examples of functional data migration are operations such as the “select” and “join” operations.

My thesis is that double categories combine these two approaches, retain their strengths, increase expressiveness, and make some things easier. Partly this is just because double categories generalize categories and bicategories simultaneously. The underlying category \mathbb D_0 of a double category \mathbb D is just that: an ordinary category. I’m thinking of it as the “functional part” of \mathbb D. On the other hand, every double category has an underlying bicategory consisting of its proarrows and those cells whose source and target are identity arrows. So, I don’t think it’s too much of a stretch to say that a suitably structured finitely-presented double category could be an olog or database schema. A data instance would then be a double functor \mathbb D \to \mathbb{R}\mathbf{el} valued in the double category of relations.

2 Model Theory

What I’d like to add to this picture is some perspective from categorical model theory. The phrase “functorial semantics” casts a broad net over a pattern of results at least in the development of algebraic theories, finite-limit theories, and the semantics of type theories. This pattern is that a given structured category \mathcal C can be thought of a theory and that a structure-preserving functor \mathcal C\to\mathcal D is a model in a suitably structured category \mathcal D. Fibrational semantics takes this one step further by viewing fibrations as theories and morphisms of fibrations as models (Jacobs 1999).

Recall that an equipment \mathbb D is a double category for which the source-target projection \mathbb D_1\to\mathbb D_0\times\mathbb D_0 is a fibration (Shulman 2008). I want to view an equipment \mathbb D as a theory: the base category \mathbb D_0 has the types and terms and \mathbb D_1 has the propositions and deductions. A structure-preserving double functor \mathbb D\to\mathbb{R}\mathbf{el} is thus a model in the double category of relations. So, if we’re willing to ask that data instances are models in this sense, then we’re led to an amazing result: data operations are functorial semantics. I’ll try to illustrate this fact throughout the rest of the post.

3 Cartesian Equipments

A nice characterization of an equipment \mathbb D is that every ordinary arrow f\colon A\to B has a proarrow companion and a proarrow conjoint. These are proarrows f_! and f^* equipped with special cells:

whose composites satisfy some equations. Here I’m denoting the external identity proarrows by the same letter as the corresponding object. In \mathbb{R}\mathbf{el}, the companion and conjoint of an ordinary function f \colon A \to B are its graph \langle 1, f\rangle \colon A\to A\times B and cograph \langle f,1\rangle\colon A\to B\times A.

More complicated “restriction” and “extension” cells can be constructed from companions and conjoints. Every niche (the figure on the left immediately below) completes to a chosen restriction cell:

and every coniche completes to a chosen extension cell:

In \mathbb{R}\mathbf{el}, a generic cell is a commutative square of the form on the right:

Here “\langle s,t\rangle” in each case denotes the source and target of the corresponding relation, and the function \theta: R \to S is unique whenever it exists. You can see from this what niches and coniches look like in \mathbb{R}\mathbf{el}:

So, given a niche as above, you can get the retriction by forming a pullback and extensions by taking an image:

It turns out that restrictions are cartesian and extensions are opcartesian with respect to the source-target projection \mathbb D_1\to\mathbb D_0\times\mathbb D_0 and make it into a bifibration. Since I haven’t written out exactly what it means for a cell to be (op)cartesian, I hope the examples in relations provide a sense. Restrictions are like pullbacks: they are limit-like things having a “mapping into” universal property whereas extensions are like colimits (images are coequalizers in e.g. toposes) having a “mapping out of” universal property.

A cartesian double category (Aleiferi 2018) is a double category \mathbb D for which the double functors \Delta\colon \mathbb D\to\mathbb D\times\mathbb D and !\colon \mathbb D\to 1 both have right adjoints in the 2-category of double categories, pseudo double functors and their transformations. The double categories of relations, of spans, and of profunctors are all cartesian. In general, to be cartesian is just that both categories \mathbb D_0 and \mathbb D_1 have finite products in a coherent way. If a cartesian double category is also an equipment and locally a poset, there’s a lot more to say. For example, the bicategory underlying any locally posetal cartesian equipment is a “cartesian bicategory” (Carboni and Walters 1987). Likewise, the underlying bicategory of a “double category of relations” is a bicategory of relations (Lambert 2021). A double-categorical olog could thus be defined as a finitely-presented “double category of relations.”

Just having a cartesian equipment lets you do a lot. For example, it turns out that each hom category \mathbb D(A,B) has finite products. These categories \mathbb D(A,B) consist of all proarrows from A to B and all “globular” cells, whose source and target morphisms are identities on A and B. Both the terminal object and products are given by restriction:

I like to refer to p\wedge q as the local product of p and q. If you use the formula for restrictions displayed above, the local product turns out to be the composite \Delta_!\otimes (p\times q)\otimes \Delta^* with companions and conjoints of the diagonals, which is exactly the formula for local products in cartesian bicategories (Carboni and Walters 1987).

4 A Couple Amazing Facts

So, now that the high-level stuff is out of the way, I’m going to start to show you how cartesian equipments can be used to do olog-type reasoning and perform operations on data. Along the way you may be wondering why all these constructions work. Basically, there are two underlying facts. When I present a double-categorical olog by some types, arrows, proarrows and cells, it should be thought of as presenting a cartesian equipment. But how does this work? Basically the explanation is that we’ve programmed a generalized algebraic theory in Catlab for cartesian equipments. Tabulators are done too. The first amazing fact is that this implementation allows you to specify objects, arrows, proarrows and cells, and then work with all the constructions available in a cartesian equipment.

So, when we give generators and relations for an olog, we get a cartesian equipment \mathbb D, which is strict as a double category. I’ll define a data instance to be a strict double functor M\colon\mathbb D\to\mathbb{R}\mathbf{el} that preserves the cartesian structure. But wait, you might think, isn’t this supposed to be a “model?” It looks like a model, but is it a morphism of fibrations? Well, the second amazing fact is that it is one. This is because pseudo, hence strict, double functors preserve all restriction and extension cells (Shulman 2008, Proposition 6.4). So, when we take a restriction or extension in our olog, the corresponding computation on actual data in \mathbb{R}\mathbf{el} will just be computing a pullback or an image. I’ll use this fact freely below.

5 Tabulators

Companions and conjoints take ordinary arrows to proarrows. What makes it possible to go back is tabulators. These are not part of the theory of functional ologs and although bicategories of relations can be “functionally complete” you get “maps” as a special class of relations, not just as ordinary arrows. In a double category \mathbb D, tabulators are given by a right adjoint \top\colon \mathbb D_1\to\mathbb D_0 to the external identity functor \mathrm{id}: \mathbb D_0\to\mathbb D_1. In more detail, the tabulator of a proarrow p is a special cell

that is universal among all such cells of that form. The tabulator of a relation viewed as a monic arrow R\to A\times B is the domain set R. For a span it is the apex set. In profunctors, tabulators are given by the category-of-elements construction.

Under the fibrational semantics interpretation, tabulators take propositions to types. You can think of the construction as associating to each proposition the type of thing for which the proposition is true. So, it’s like a generalized comprehension scheme (Jacobs 1999, chap. 9 and 10). Say for example you have some notion of what it means to be a Goa’uld and you include this in an olog as a proposition

Perhaps you’d like to reason about symbiotes of that particular type, or at least relate symbiotes of that type to other types of some olog. In this case you could take a tabulator

and now you have a type to work with. Tabulators are strong when the extension l^*\otimes r_! is (isomorphic to) the original proarrow p. You can always take such an extension, but the “strong” assumption ensures that you return to the proarrow you started with. This is like taking the elements construction of a set-valued functor and then the fibers construction of the resulting fibration to get back the original functor; or taking the characteristic map of the subobject classified by an \Omega-valued morphism in a topos to get back the original morphism.

When designing ologs, it’s usually the case that you have to make some choices: which things are types, which are propositions/relations etc. But when you ask for tabulators that are at least strong, this allows you to a certain extent to just make any choice and then go back and forth between types and propositions as needed. I say “to a certain extent” only because with strong tabulators, the operations are inverse only in one direction. You need a bit more to start, say, with a relation, take an extension along the legs and then tabulate to get back to the original relation. This is basically what I’ve called a subobject comprehension schema elsewhere (Lambert 2021). It’s a powerful assumption that I won’t adopt here since it basically makes your double category into one of the form \mathbb{R}\mathbf{el}(\mathcal E) for some regular category \mathcal E.

6 Facts

In a functional olog, a fact is a commutative diagram. That is, viewing arrows as attributes, a fact expresses that two complex (i.e. composite) attributes are the same. Here’s an example lifted directly from (Kent and Spivak 2012):

I’d like to explore facts a bit. Since cartesian double categories in particular have terminals, we can talk about constants. Such constants can be added to an olog when we have particular individuals or things in mind. This is consistent with what you’d do in a classical first-order signature. So, let’s posit two individuals, namely, Teal’c and Apophis whom we include in the olog with constants

In the functional olog style, the fact that Teal’c is (or rather was) the First Prime of Apophis can be expressed as a commutative diagram

The above phrasing seems to assume that to each System Lord we can assign a unique First Prime (maybe we’re looking at a fixed moment in time). But what if at that moment in time a given System Lord is between First Primes (so we have in fact have a partial function)? Or perhaps we would like our data to include all the First Primes a particular System Lord has had (one-to-many), or we’d like to consider all the System Lords a given Jaffa warrior has served as First Prime. In any case, we’ll posit a single relation and two constants

But then how do we express the fact that Teal’c is (or was at some point) a first prime of Apophis? We can work internally in the double category. So, let \mathbb D denote the cartesian equipment generated by this presentation. Recall from above that restrictions are the double categorical way of giving an internal pullback. So, we’ll ask that the accompanying restriction cell factors through (or rather is) the terminal object in the external hom category on 1. That is, form the restriction

Now, say that \top denotes the terminal object of \mathbb D(1,1). The fact that Apophis indeed has (or had) Teal’c as a First Prime is just asking that the proarrow restriction above is the terminal:

in \mathbb D(1,1). Alternatively, to minimize equations, we could ask that there is a cell between these two proarrows with domain \top. You can then imagine a model \mathbb D\to\mathbb{R}\mathbf{el} (essentially just lists of System Lords and their Jaffa) and do all the computations in \mathbb{R}\mathbf{el}. You’d get a pullback of set functions and the two points are equal if, and only if, they’re related under the “first prime” relation. This gives some simplification of the bicategorical development of (Patterson 2017). Functional relationships can be handled there using “maps” (those morphisms in the bicategory having a right adjoint) but this involves imposing some equations that we don’t need here.

7 Operations

By “operations on data,” what I mean is queries that return information such as (1) selecting columns from a table, (2) filtering a table for rows with certain attributes, and (3) joining two tables along a common column. These are all well-known SQL operations and appear in one form or another in the foundational work on relational databases (Codd 1972).

Functional ologs handle these operations with adjoint functors (Spivak 2012). That is, given database scehmas \mathcal C and \mathcal D, together with a translation F\colon \mathcal C\to\mathcal D, there is a “substitution” functor F^*\colon [\mathcal D,\mathbf{Set}] \to [\mathcal C,\mathbf{Set}] and its adjoints, namely,

An example in the reference shows that the substitution functor can do select/projection queries; the right adjoint does inner joins; and the left adjoint produces the somewhat mysterious outer join. I’m going to show here that cartesian equipments and relation-valued double functors can do all this without invoking translations of schemas or left and right Kan extensions. Rather, the structure-preserving double functor does all the munging for us.

7.1 Select

Select queries are handled by taking an extension along projection morphisms. Suppose we have an olog describing various attributes of a mission concept. This is a 4-ary relation:

Of course we could have chosen to associate the factors differently, but it would have amounted mathematically to essentially the same thing. We’ll keep things simple and instance this olog by a conventional multicolumn datatable:

date location purpose team
1998-02-06 P41-771 search & rescue SG3
1998-07-31 Cimmeria assist Cimmerians SG1
1999-01-02 P3R-272 investigate inscriptions SG1
1999-10-22 Netu search & rescue SG1
2004-08-06 Tegalus negotiation SG9

I’ll assume that in the background the types have been instanced, but it’s not really necessary here to say at this point what the data actually is. In any case, selecting columns is simply a matter of extending along a projection. For example, say we want only to remember the dates that teams went on off-world missions but we care about neither the location nor purpose of the missions. In this case, we take an extension along the projection morphisms

Notice that we really need some of the cartesian structure. We’ve already seen that in \mathbb{R}\mathbf{el} the extension is computed by taking an image. That is, a date and team are in the extension relation if, and only if, they are related (of course with two other elements of some 4-tuple) in the original relation. In this case, we end up with a binary relation instanced by the table

date team
1998-02-06 SG3
1998-07-31 SG1
1999-01-02 SG1
1999-10-22 SG1
2004-08-06 SG9

since the model M\colon \mathbb D\to\mathbb{R}\mathbf{el} preserves extensions. There are of course lots of other things you might do. Perhaps you want to see the dates on which locations were visited but don’t care who did it or why. That would just be a different pair of projections. In any case, I’ll take this as sufficient evidence that the select operation can be handled by extension along projection morphisms. The point is to simply use whichever projections return the desired columns.

7.2 Filter

The other operation is filtering data by asking for rows that satisfy a certain property. Usually this is that a particular attribute has a specific value. This is one of the last topics of (Spivak 2012), treated in Section 5.3 in an example using adjoint functors arising from slices of copresheaf toposes. In our double-categorical set up, filtering can be done using restrictions provided we allow ourselves types in the olog that isolate whatever attribute value we’re looking for. Say, for example, we wish to filter the previous data for just the missions assigned to the team SG1. In this case we’d have to add to our olog the name of a global element of the type, namely, an arrow from the terminal

We define the relation “SG1 missions” to be the restriction

The model M\colon \mathbb D\to\mathbb{R}\mathbf{el} then returns the table SG1 Missions corresponding to the domain of the restriction cell:

date location purpose team
1998-07-31 Cimmeria assist Cimmerians SG1
1999-01-02 P3R-272 investigate inscriptions SG1
1999-10-22 Ne’tu search & rescue SG1

At this point if we wanted to select certain columns (say we were interested only in the dates on which SG1 visited certain places) we could use the projection-extension method above to produce a table with just the appropriate columns. This combination would give some version of a select-filter operation of SQL.

7.3 (Inner) Join

This post is already getting long, but I want at least to show what an inner join looks like. This is neat because it really showcases the interplay between the cartesian structure and the equipment structure. Here’s a general procedure for how this works. The idea is basically just to match columns by restricting along a diagonal. That is, given two proarrows, for example,

we can define the inner join of p and q along B to be the domain of the restriction cell

Think of this as giving two tables, namely, p and q with entries from sets A, B, C and D. The tables have entries in one column in common, namely, those from the set B. We take the cartesian product of the relations p\times q and then pull back along the diagonal morphism \Delta on B asking that those projection values are the same while leaving the others alone. An example should help to make this clearer.

These simply express the fact that persons may have expertise consisting of a particular training or facility in a particular skill; and that each person may or may not belong to a particular SGC team. Assume that this data is assigned by a strict double functor M\colon\mathbb D \to \mathbb{R}\mathbf{el} amounting to two tables:

person skill
Hammond command
Kovacek law
Maybourne chicanery
Morrison combat
Rothman archaeology

and

person team
Kovacek SG9
Morrison SG3
Rothman SG11

Assuming that M is cartesian, it preserves local products and more generally inner joins, as defined above. In our example, the inner join along the “person” column is given by the following restriction, namely,

So, to compute the image of the domain relation, compute the cartesian product of the two relations, then take the pullback matching the “person”-argument. We’ll get the following table:

person skill team
Kovacek law SG9
Morrison combat SG3
Rothman archaeology SG11

Notice that there are no duplicated columns and no null entries. Of course, we could have made the example a bit more interesting by taking the tables to be genuine relations (the multifaceted Harold Maybourne has many skills after all), but at this point I just want to illustrate the process.

8 Future Work

So, at this point, I’m left with a couple of items I’d like to mention about further developments. The operations above covered a number of SQL queries. But I’d like to understand better how to do the left adjoint in the functional set-up. This produces something like the outer join of two tables, which has the effect of keeping all the rows of both tables but also inputting Skolemized nulls whenever there isn’t a matching entry. I think this could be done by asking that the cartesian equipments the ologs generate have some colimits, but I’m not sure which ones or how precisely the construction would go.

Finally, I’d like to mention that since Evan and I have begun implementing these ideas in Catlab, I have this vaguely conceived ambition that there might be a way to actually write a query language for data structures built on the pattern of these cartesian equipments. There could be an improvement in efficiency since the current implementation of migration operations in terms of set-valued functors involves limits and colimits over comma categories. There are lots of equations in this set up and, as far as I can tell, what I’ve presented so far in terms of double categories eliminates many of those.

References

Aleiferi, Evangelia. 2018. “Cartesian Double Categories with an Emphasis on Characterizing Spans.” PhD thesis, Dalhousie University. https://arxiv.org/abs/1809.06940.
Carboni, A., and R. C. F. Walters. 1987. “Cartesian Bicategories I.” Journal of Pure and Applied Algebra 49: 11–32. https://doi.org/10.1016/0022-4049(87)90121-6.
Codd, E. F. 1972. “Relational Completeness of Data Base Sublanguages.” Research Report / RJ / IBM / San Jose, California RJ987.
Jacobs, Bart. 1999. Categorical Logic and Type Theory. Studies in Logic and the Foundations of Mathematics. Elsevier.
Kent, Robert E., and David I. Spivak. 2012. “Ologs: A Categorical Framework for Knowledge Representation.” PLoS ONE 7 (1). https://doi.org/10.1371/journal.pone.0024274.
Lambert, Michael. 2021. ‘Double Categories of Relations’.” https://arxiv.org/abs/2107.07621.
Patterson, Evan. 2017. “Knowledge Representation in Bicategories of Relations.” https://arxiv.org/abs/1706.00526v2.
Shulman, Michael. 2008. “Framed Bicategories and Monoidal Fibrations.” Theory and Applications of Categories 20 (18): 650–738.
Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217: 31–51. https://doi.org/10.1016/j.ic.2012.05.001.