The many facets of Networked Mathematics
Last year we started a discussion on the production of mathematics, the pre-industrial kind of process that we all still follow of solving problems and taking the solutions to the market of ideas (conferences, seminars, blog posts, twitter, coffee breaks in the department, etc.) as drafts/preprints/submissions and all the difficulties that this process involves. In particular we discussed the difficulties of checking whether the work had been done before or not, and the difficulties of searching for other, similar work that might help us with a given question. In this blog post we give an update on our project.
Last year we started a discussion on the production of mathematics, the pre-industrial kind of process that we all still follow of solving problems and taking the solutions to the market of ideas (conferences, seminars, blog posts, twitter, coffee breaks in the department, etc.) as drafts/preprints/submissions and all the difficulties that this process involves. In particular we discussed the difficulties of checking whether the work had been done before or not, and the difficulties of searching for other, similar work that might help us with a given question.
As a group we settle on a path of trying to apply the new tools of NLP and knowledge representation (neural nets, word embeddings, transformers, ontologies, semantic web tools, etc.) to what might be called mathematical English. We reasoned that, yes eventually we will need to get to the semantics of equations and diagrams, but it would be useful to know how much we can learn simply from using the tools of NLP to extract mathematical concepts from abstracts of articles.
We also decided that Category Theory would be our prototype domain. First because category theory is about the organization of Mathematics in general, discovering the backbones of the subject and the hidden similarities between, in principle, unrelated subareas. But secondly because we are category theorists, after all. So this is the domain in which we are domain experts.
We realized early on that corpora of mathematical text were needed and we hoped this wouldn’t be a problem. Category theory was, after all, one of the first areas of Mathematics to have a free open-source journal, namely TAC (Theory and Applications of Categories) in 1995. But we need much more data than simply one journal. While we also have the nLab as a beautiful mathematical research open resource and the whole of the Wikipedia Math Portal, even the licensing of the arXiv is more problematic than we knew.
As we explained in the earlier blog post, we have chosen to work with a corpus of TAC abstracts and one (snapshot) of the nLab from the beginning of 2021. The TAC abstracts are clean, with not so much LaTex thrown in, and form a small corpus of 3.2K sentences. The nLab is a bit more noisy, one needs to remove the wiki markup and sections and headings get lost, a traditional nightmare of processing scientific text.
We now have spaCy processing of all this data, as well as a collection of experiments on term extraction, using different tools: TextRank, OpenTapioca, DyGIE++ and Parmenides, a special tool of our NIST collaborators. There are plenty of interesting observations to make about the processing and the comparison between these tools and we are preparing a write-up about it. However the major problem with domain specific NLP strikes again: we have no golden standard for mathematical concepts in mathematical texts, much less for category theory concepts. Our predicament is not only ours: many others have the same issue and have tried to make do with not guaranteed good results. Meanwhile we have been working on the idea that formalized mathematics will eventually be the way of producing mathematics.
To discuss the issues of formalization we have had in the Topos Colloquium talks by several people involved with automated deduction: Jeremy Avigad, Kevin Buzzard, Larry Paulson, Florian Rabe, for example. We hope to have more talks this year. Finally, we’re happy to be helping to organize a Hausdorff Institute of Mathematics trimester on “Prospects of Formal Mathematics” in 2024.