Memories of writing a parser for man pagesMarch 23, 2018
I generally enjoy being bored, but sometimes enough is enough—that was the case a Sunday afternoon of 2015 when I decided to start an open source project to overcome my boredom.
Back then, I was familiar with manual pages as a concept and used them a fair amount of times, but that was all I knew, I had no idea how they were generated or if there was a standard in place. Two years later, here are some thoughts on the matter.
How man pages are written
The first thing that surprised me at the time, was the notion that manpages at their core are just plain text files stored somewhere in the system (you can check this directory using the
This files not only contain the documentation, but also formatting information using a typesetting system from the 1970s called
troff, and its GNU implementation groff, are programs that process a textual description of a document to produce typeset versions suitable for printing. It's more 'What you describe is what you get' rather than WYSIWYG.
— extracted from troff.org
If you are totally unfamiliar with typesetting formats, you can think of them as Markdown on steroids, but in exchange for the flexibility you have a more complex syntax:
groff file can be written manually, or generated from other formats such as Markdown, Latex, HTML, and so on with many different tools.
groff and man pages are tied together has to do with history, the format has mutated along time, and his lineage is composed of a chain of similarly-named programs: RUNOFF > roff > nroff > troff > groff.
Moreover, It's worth noting that
groff can also call a postprocessor to convert its intermediate output to a final format, which is not necessarily ascii for terminal display! some of the supported formats are: TeX DVI, HTML, Canon, HP LaserJet4 compatible, PostScript, utf8 and many more.
Other of the cool features of the format is its extensibility, you can write macros that enhance the basic functionalities.
With the vast history of *nix systems, there are several macro packages that group useful macros together for specific functionalities according to the output that you want to generate, examples of macro packages are
mm, and the list goes on.
Manual pages are conventionally written using
You can easily distinguish native
groff commands from macros by the way standard
groff packages capitalize their macro names. For
man, each macro's name is uppercased, like .PP, .TH, .SH, etc. For
mdoc, only the first letter is uppercased: .Pp, .Dt, .Sh.
Whether you are considering to write your own
groff parser, or just curious, these are some of the problems that I have found more challenging.
groff has a context-free grammar, unfortunately, since macros describe opaque bodies of tokens, the set of macros in a package may not itself implement a context-free grammar.
This kept me away (for good or bad) from the parser generators that were available at the time.
Most of the macros in
mdoc are callable, this roughly means that macros can be used as arguments of other macros, for example, consider this:
- The macro
Fl(Flag) adds a dash to its argument, so
- The macro
Ar(Argument) provides facilities to define arguments
Op(Optional) macro wraps its argument in brackets, as this is the standard idiom to define something as optional.
- The following combination
.Op Fl s Ar fileproduces
Opmacros can be nested.
Lack of beginner-friendly resources
Something that really confused me was the lack of a canonical, well defined and clear source to look at, there's a lot of information in the web which assumes a lot about the reader that it takes time to grasp.
To wrap up, I will offer to you a very short list of macros that I found interesting while developing jroff:
- TH: when writing manual pages with
manmacros, your first line that is not a comment must be this macro, it accepts five parameters: title section date source manual
- BI: bold alternating with italics (especially useful for function specifications)
- BR: bold alternating with Roman (especially useful for referring to other manual pages)
- .Dd, .Dt, .Os: similar to how
manmacros require the
mdocmacros require these three macros, in that particular order. Their initials stand for: Document date, Document title and Operating system.
- .Bl, .It, .El: these three macros are used to create list, their names are self-explanatory: Begin list, Item and End list.