Declarative Models and Collaborative Modeling
1 Models vs. Analysis Recipes
In the early days of computing, programs were written in order to automate mathematical calculations that required a great deal of arithmetic. As examples:
- numerical differential equation solving (application: simulations in systems biology)
- Fourier transforms (application: analyzing and manipulating audio signals)
- linear regression (application: estimating parameters in an economic model)
- constrained optimization (application: supply chain management)
These kinds of programs answer questions that arise from mathematical models of the world. Moreover, the results of those programs must be interpreted in light of the model that was used to formulate the questions asked of them.
One quite basic example of how this plays out is that typically computers don’t have a built-in notion of units. Computer programs work in terms of numbers. This means that at the end of a computation, when the computer spits out “5.7”, you have to be ready to interpret that as, say, 5.7 meters.
A more subtle example is that in rocketry one might assume that a fin is perfectly rigid while formulating your differential equations. There might be good reasons for making this assumption: it simplifies things and is mostly true. But it is important to remember that you made this assumption when you are interpreting the numerical results of your computation. Namely, you would want to remember to redo the computation if that assumption is later found to not be true; if in fact the floppiness of the fin is found to be significant.
We might term computer programs used in this way to be “analysis recipes.” That is, such a program is a machine-executable recipe for performing some kind of analysis related to a model. This is reflected in how Jupyter notebooks work, for instance: a Jupyter notebook is a sequence of blocks of code which are (supposed to be) executed in order.
However, and this is a key point, the analysis recipe is not the model. Generally, the model is not communicated formally; the model is communicated via documentation (diagrams, equations, prose etc.) that is distributed along with the analysis recipe in the form of papers or source code comments. There is a growing trend in science that “analyses should be reproducible,” which means that scientific papers provide links to the data and analysis recipe used to come up with the conclusions in the paper. But what would be more ideal is for the model itself to be formally reusable. What exactly this means is the subject of the next section.
2 Declarative Modeling
With the last half-century of progress in computer science, we have developed algorithms that can reason logically and symbolically rather than just numerically. This unlocks the capability to work with a computer in a different way than “analysis recipes.”
Specifically, instead of saying: “do A, then do B, then do C, what’s the result?”, we can say “X is true, Y is true, Z is true, according to the knowledge I’ve given you is W also true?”
There have been many programming languages which work in this second paradigm. One is Modelica, another is Prolog.
Modelica is a language that allows one to specify a cyber-physical system by writing down a bunch of variables and then writing equations that involve the variables and their time derivatives. One can then ask the system to solve the implicit differential equations that result from all of these variables and equations. Crucially, one does not have to write equations in the form
\frac{dx}{dt} = v(x,y,z)
Instead, one can write things like V = IR (the equation for a resistor), and then later down the line it might turn out that I = \frac{dQ}{dt} (current is the derivative of charge). But the resistor can stand alone as just consisting of that equation.
Similarly, prolog is a language which allows one to write down facts and rules for deriving facts from other facts, and then ask which facts can be derived. For example:
parent(alice, bob)
parent(bob, carol)
ancestor(X, Y) :- parent(X, Y)
ancestor(X, Z) :- ancestor(X, Y), parent(Y, Z)
ancestor(?, carol)
This should print out “bob, alice” because both Bob and Alice are ancestors of Carol.
Declarative modeling allows one to make more assumptions explicit, because essentially each “declaration” is an assumption. However, it is important to emphasize that there are still plenty of assumptions that are left implicit. For instance, interpreting the prolog program that I just wrote involves assuming that there are actually people in the real world named Alice, Bob, and Carol (or at least code-named Alice, Bob, and Carol), and that Alice actually is Bob’s parent, etc. As long as humans are trying to get answers about the world out of computers, there will always be a layer of informal reasoning that is needed to interpret the flashing lights that the computer produces. However, declarative models can allow some assumptions to be captured explicitly, rather than implicitly as would be the case in an “analysis recipe.”
At Topos and with our collaborators, we have built a lot of tools and theory for declarative models using category theory and Julia in the AlgebraicJulia project. In AlgebraicJulia, we focused a lot on compositionality via graphical languages such as directed and undirected wiring diagrams.
For instance, in this documentation for the AlgebraicPetri project we show building Petri nets for variants of the Lotka-Volterra Model compositionally using wiring diagrams. In this documentaiton for AlgebraicDynamics, we do compositional malaria modeling. The Decapodes documentation has a lot of really neat examples of compositional PDE modeling, including a lot of fun gifs.
In order to make compositional modeling with graphical languages more accessible, we built Semagrams, which is a generic toolkit for building diagrammatic UI for models. You can try out Semagrams at the dwd demo, or the petri net demo (keybinds for petri net editor).
However, in the process of building Semagrams, I noticed that there were a lot of building blocks that needed to be put in place before it could really work. Specifically, version control is essential once things scale to a certain size. But version controlling the JSON output of Semagrams via git would be a non-starter, because one would have to be intimately familiar with the Semagrams internals in order to resolve merge conflicts. That leads us to our next section.
3 Collaborative Modeling
In the real world, scientific and engineering projects are undertaken as collaborations between many people. In recognition of this, we at Topos are trying to build tools for people to collaborate on declarative scientific models: see CatColab.
CatColab allows several people to work together in a Google Docs style structure editor for category theoretic models; this is because this was easiest to start out with. However, the google docs model is slightly troublesome, because different people have different assumptions and may not realize it. Thus, it is important when collaborating on a scientific model to track who is making what assumptions, and ideally to capture why they think that those assumptions are valid in a certain context. And furthermore, it is important to track how people are changing their assumptions over time.
One should be able to look at an analysis produced by a model and ask “what are the specific assumptions that went into this particular result, who made them, what did they write in justification, and how much would it change the analysis if those assumptions were different?” This is an incredibly important series of questions to ask when scientific models are used in settings that could impact the lives of many people.
In order to do this, we have to rethink how version control works. Currently, version control just operates on text files, and tracks insertion and deletion of lines. What we need is to track changes to declarative models in a more structured way. For instance “adding a variable to a model” or “changing a constant from 3.0 to 5.0”. Then we can look at the history and see something like “Eve changed this constant from 3.0 to 5.0 on Thursday because she got new measurements in from the lab (link to the lab log in which the measurements were recorded).”
Of course, in order to represent changes to a model in a structured way, we also have to represent the models themselves in a structured way.
We at Topos are working on all of the puzzle pieces to make this work, from mathematical theories of declarative modeling down to data structures for the version control.
4 What’s next
There has been a some research on version control from a mathematical perspective (Mimram and Di Giusto (2013), Angiuli et al. (2016)), and there has also been research on patches in the context of incremental computing Alvarez-Picallo (2020). What I’m currently working on is to get an overview of the field, including both the mathematics and the history of implementation in both version control and incremental computing, so that my next steps are well-rooted in historical understanding (as, after all, this is not a new problem even if it is somewhat underlooked).
Then, I’m excited to work with David Jaz on a variety of type theoretic approaches to version control.
First of all, in order to version control elements of a type, they must be serializable, and they should be serializable across different programming languages. This rules out function types. However, I believe that if we introduce a concept of “finite type,” then functions out of finite types should be just arrays, and thus serializable! The theory here has to be worked out in full, but basically the objective would be to make an expressive type theory for types, the elements of which are always serializable, which can express both “database-like” stuff (ACSets), and algebraic data types.
Then, we want to try to interpret the bridge types from Higher Observational Type Theory as representing patches. If this is successful, we should be able to write down types in Narya and derive their “diff structure” automatically. This would provide a way to test out ideas about diffs quickly, without having to write our own compiler (though, we will probably end up writing our own compiler eventually!)
If any of this sounds interesting to you, we are hiring for a postdoc to work in this area! If you have previous experience in any of:
- distributed version control
- database theory
- type theory
- homotopy type theory
- programming language theory
- incremental computing
- cryptographic protocols and hash-based data structures
you might be a good candidate: see here for more details and instructions on how to apply.