Parser combinators for the "modern" web

The problem of parsing is a staple in computer science. Wikipedia gives the following definition:

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).

So in a way parsers are programs that impose or discover structure on generally unstructured text data. This comes in handy for researching, formatting or even producing representations of data. In general the input of parser is some textual data, and its output is an Abstract Syntax Tree (AST), a hierarchical representation of the data structured along the rules.

In this post I will research a most elegant way to solve this problem: Parser combinators. The people of formal computer science education must be well acquitted with LL, LR or other traditional parsers. However parser combinators are a novel way to quickly put together somewhat efficient parsers. A parser combinator is a function that accepts several parsers and returns a new one. This will become much more evident when we dive into the code, but the main idea is that we write simple parsers for the basic elements of our "grammar" and then combine them to parse complex structures.

No method is of importance if it doesn't solve interesting problems. So our problem for this post is to define a parser that will create an AST for image-board posts. To outline the problem better we want to parse the following elements in a user submitted post:

Bold: words between "*".
Replies: series of digits that start with ">>".
Quotes: lines that start with ">".
Links: in the markdown form of "[description](link)".
Simple text: everything that's not anything of the above.

Given that this is our grammar and also that we want to parse the input from a web form, we will use js, in order to send nice structured elements to our back-end. Since the parser is quite efficient to work on the front-end we can save the compiled html in our database.

The parser outlined here is based on the research to solve the above problem for the comments-imageboard for static blogs I'm developing under the name discordia-chan. So while it is a mostly experimental code, I hope that at the same time will be good enough for small scale production code.

To help with development we will use the arcsecond js library, that provides most of the utils we need to define and combine plain parsers into the final one. The development of a minimal library is also possible if we want to maximize performance or eliminate external dependencies. The structure of our parser will be similar to the one used by pandoc, the "killer app" of the Haskell world and one of the best software showcasing the power of parser combinators. For more info on the idea of parser combinators you can check out parsec, the library behind pandoc.

starting out simple

To use arcsecond we first have to install it, canonically with:

npm install arcsecond

Let's use ES6 imports to get some helper functions to start working with. These functions are mostly parsers that take other parsers as inputs. If this sounds complicated, I hope it will become more clear later on.

import { sequenceOf, many1, choice, letters, digits, optionalWhitespace,
         str, between, char, many, regex, sepBy, anythingExcept } from "arcsecond";

Remember that to run your program with nodejs, using ES6 imports, you have to use the esm module like so (you may have to install it first):

node -r esm src/utils/parseTextarea.js

So let's start working on our grammar requirements. Bold text is to be enclosed in * so we want to parse text inside of *. This can be easily achieved by using the parser combinator between.

const betweenStars = between (char('*')) (char('*'));
const boldParser = betweenStars (many(wordParser));

console.log(boldParser.run("*this is some bold text*"));

Running this script in nodejs will give us the following result:

{
  isError: false,
  result: [ 'this ', 'is ', 'some ', 'bold ', 'text' ],
  index: 24,
  data: null
}

So we have succesfully parsed the words that need to be bold into an array of results. The results are in an array because each of them is parsed by wordParser and then boldParser checks for many of them between "*". This way we combine the wordParser and the betweenStars parser into a more powerful one. Still in the greater concept this is not yet that useful since we need to add some information about what we parsed, in order to build our AST.

const boldParser = betweenStars (many(wordParser)).map (results => ({
    type: 'boldText',
    value: results.join ('')
}));

In the second implementation of boldParser we are now giving a type to our data. This way we keep the imprortant values, as well as the structure information needed for representing it later. Note that map works on the results array by default. There are also ways to manipulate the data field for stateful parsers, but these will be out of the context of this post.

{
  isError: false,
  result: { type: 'boldText', value: 'this is some bold text' },
  index: 24,
  data: null
}

Those with keen eyes will have noticed that I have used the parser wordParser without introduction or definition. I will introduce it later, but as of now think of it as a parser for everything except special symbols in our text.

With these in mind and the use of of the various helpers from arcsecond we can define the rest cases of our grammar. We can also define two helper functions join and output, that we can use to map over our results.

const join = array => array.join ('');

// notice the use of currying for easier application with map
const output = type => array => ({
  type: type,
  value: array
  });

const replyParser = sequenceOf([
    str('>>'),
    digits
]).map (results => output ('reply') (results));

const quoteParser = sequenceOf([
    char('>'),
    optionalWhitespace,
    wordParser
]).map (output ('quote'));

const betweenParens = between (char('(')) (char(')'));
const betweenBrackets = between (char('[')) (char(']'));
const urlParser = sequenceOf([
    choice([
        str('https://'),
        str('http://'),
        str('/')
    ]),
    letters,
    char('.'),
    letters
]).map(join);
const linkParser = sequenceOf([
    betweenBrackets(wordParser),
    betweenParens(urlParser)
]).map (results => ({
    type: 'link',
    url: results [1],
    desc: results [0]
}));

const plainParser = wordParser.map (output ('plain'));

the elusive wordParser

The (probalby misnamed) wordParser that is the backbone of many of the afformentioned parsers is assumed that it works likeso: it consumes anything, except the special symbols of our grammar. This is an issue that I haven't completely solved, since I've not yet decided if I want to allow the out of context use of these symbols. Naively the wordParser can be implemented as below:

const wordParser = many1 (anythingExcept (regex (/^(\*|\]|\)|\n|>|\(|\[)/))).map(join);

Two things are of notice here. Fist the very powerful regex parser. Second the problems that arise by the implementation. If the symbols { *, [, ], >, (, )} happen to appear in other uses apart from their structural meaning, the rest of the text will be lost and not parsed.

This can be solved by implementing escapes, though I'm not sure this would be useful for inputing comments on a web application.

the power of combination

With all our basic parsers in place we can now define what a paragraph is. For short texts, like the ones we deal with hear, it's enough to assume that a paragraph is denoted by a newline. Our work is mostly done, we just have to combine our work:

const paragraphParser = many (choice ([
    replyParser,
    boldParser,
    urlParser,
    quoteParser,
    linkParser,
    plainParser
])).map (output ('paragraph'));

export const inputParser = sepBy (char ('\n')) (paragraphParser);

So a paragraph is defined as many runs of a choice of parsers. We run the paragraphParser on every body of text that is separated from any other with a newline and done! The result is a tree as the following example:

[{ type: 'paragraph', value: [
    { type: 'plain', value: "This is some text" },
    { type: 'bold', value: "and some bold text." }
]},
 { type: 'paragraph', value: [
     { type: 'quote' value: "> We quote someone."}
 ]}
]

Our AST is ready for any transformation we want to impose on it!

conclusion

The parser presented here is surely a naive implementation, and barely touches the surface of what parser combinators can do. The case of recursive parsers as left as an exploration for the reader! However with few lines of code and a quite transparent logic we can impose our structure to any kinds of text.

That's all for now! See you next time.