Introducing the MathFoldr Project

MathFoldr
AI
NLP
language
Authors
Published

2021-07-11

Abstract

At Topos we believe knowledge empowers people, and that our community’s expertise should be available to all who seek it out. But simple availability is not enough. True access is more than an open door: it’s clear, legible street signs, elevators, and gently sloping on-ramps. And with modern AI and natural language processing tools, we believe it’s beyond time to build these accessibility tools for science and mathematics. This blog post provides an overview of our nascent MathFoldr project, sharing our dreams and our approach so far, and even, at the end, a chance for you to get involved!

A telegram tape

How do we encode, connect, and share knowledge in the 21st century?

At Topos we believe knowledge empowers people, and that our community’s expertise should be available to all who seek it out. This is why, for example, we broadcast most of our seminars live on YouTube, and actively support numerous open publishing projects, such as the journal Compositionality, the nLab community wiki, or simply making our books freely available online.

But simple availability is not enough. True access is more than an open door: it’s clear, legible street signs, elevators, and gently sloping on-ramps. And with modern AI and natural language processing tools, we believe it’s beyond time to build these accessibility tools for science and mathematics.

This blog post provides an overview of our nascent MathFoldr project, sharing our dreams and our approach so far—and, at the end, a way for you to help just by doing a concrete, 5 minute activity!

1 How do we organise mathematics?

A cornerstone of accessibility is search, and math is not easy to search. A striking, recent example comes from Quanta Magazine, November 2019. A group of physicists discovered a useful identity relating eigenvectors and eigenvalues, and did not know if it was novel. To check, they emailed a number of mathematicians, including Fields Medallist Terence Tao. Despite believing the result was “so short and simple—it should have been in textbooks already”, Tao had not previously heard of it. This led to a paper submitted for publication and, soon after, the article in Quanta. In the weeks after the story emerged, more than three dozen previously published instances of the result were reported, dating back to 1934. How can it be that even eminent mathematicians cannot find a widely published, basic result within their field of expertise?

The simple answer is that the mathematical literature has grown far too vast for even an expert to keep track of it all. A recent analysis finds over 120,000 math papers published in 2017 alone, with this rate growing exponentially at 3% a year.

E. Dunne, “Looking at the mathematics literature”, Notices of the American Mathematical Society, vol. 66, no. 2, pp. 227–230, 2019. ams.org/journals/notices/201902/rnoti-p227.pdf

Our infrastructure for organizing and communicating these results has not kept up. The ramifications are significant: wasted search time, duplication of research, and missed connections between fields.

2 The MathFoldr vision and path so far

We seek to address this with our MathFoldr project, part of our Networked Mathematics theme. MathFoldr will provide search and literature curation tools that will make mathematics more accessible, with the ultimate goal of transforming the way mathematics is created and navigated.

Today mathematics is rather artisanal: mathematicians craft pdfs of new knowledge, and share these via posting on websites and advertising them in talks. Many recent technologies, from Google, to GitHub, to materials discovered via NLP over materials science literature, show the potential for something much more efficient and effective.

Our strategy for improving this begins with improving the organization and dissemination of math via NLP-powered tools, such as search engines, knowledge graphs, and glossaries, as an entry way to shift publication practices towards ever more formal representations. So the first task is to build ontologies, and to get the community excited and involved with good UI/visualizations.

Right now, we’re doing pilot studies with two corpora, both available on our GitHub:

These pilot studies aim to create a synthesis of machine- and community-driven methods of extracting and curating an ontology of categorical concepts, which will then be maintained via WikiData. From this ontology, we will then build tools to do concept recognition and other semantic processing tasks. A prototype tool, led by Antonin Delpeuch, is nLab.OpenTapioca.

A screen capture from nLab.OpenTapioca. nLab.OpenTapioca takes text and annotates it by identifying categorical concepts. The concepts are WikiData entities, and each is linked to an nLab article.

To extract ontologies, we’ve been collaborating with Jacob Collard and Eswaran Subrahmanian at NIST (the US National Institute of Standards and Technology), who have built an exciting pipeline that preprocesses the text with spaCy, and then uses a root- and rule-based linguistics method (R&R) to extract concepts. You can navigate the results with their Parmesan interface:

A screen capture from Parmesan TAC, which displays subject-verb-object triples automatically extracted from the TAC corpus using the R&R method.

3 What’s next?

At present, we’re thinking about how to further clean these corpora, and refine the R&R methods to extract more accurate and precise concepts.

Simultaneously, we’re also thinking about the word embedding models that are both used by these toolkits, and that we could use separately to refine search and other methods. A central problem is that mathematical text is quite different from standard newswire English, and so, as with any domain-specific text, we’re seeing a number of processing errors. Can we tune these better for mathematical text?

A visualisation of a word-embedding semantic model for category theory concepts, trained on the nLab corpus. The model extracts, from analysis of statistical patterns in text alone, the strong similarity between schemes, varieties, and manifolds. (Click here for a larger version of this image.)

But ultimately, as with any data-driven enterprise, the quality of the output depends crucially on the quality of the input. And so throughout this all, we’re working to improve our corpora, to more accurately capture the expertise and intuitions of mathematicians. Here, we have a request of you: please contribute your expertise and intuitions.

More precisely, we’d love some help identifying concepts in abstracts from Theory and Applications of Categories, to produce what’s known as an “annotated corpus”, which we will share openly for NLP experiments to benefit the scientific community.

Contributing

To contribute, just choose your favorite TAC abstract, and click this link!