Missing Ligatures and How to Find Them
In typography, most glyphs represent a single letter. An exception to this is the ligature, which is a single glyph representing multiple letters. For example, the ampersand (&) is actually a ligature of the letters e and t, spelling et, the Latin word for and.
Ligatures are often used to improve the appearance of text. Consider the following examples:
In some fonts, placing an f in front of an i or l results in near-overlap between the letters, which may not look good. Instead, the two individual letters are replaced by a single glyph that is specifically crafted to connect the letters while looking nice: a ligature. Some common ligatures, in addition to fi and fl, are fj, ff, ffi, and ffl, which I will use for analysis in the remainder of the article.
My interest in ligatures was piqued when I was trying out Python libraries for extracting the raw text from PDF files. I encountered some PDFs that contained ligatures that were not decodable by the text extraction libraries. A minimal example of such a PDF can be found here, and the decoding problem can be easily visualized by copying and pasting the text of the PDF.
I wanted to address the problem somehow, but PDFs are complicated and this thread discouraged me from approaching the problem by trying to improve PDF parsing. Instead, I was inspired by a comment in another forum suggesting that there aren't that many words in English that contain the letters of the common ligatures listed above. Further, those that do exist are generally pretty easy to recover unambiguously, even when the ligature is missing. I wanted to investigate.
To aid in this investigation, I created this library, written in Python. Its main purpose is to take a corpus of words, find the ones containing ligatures, and create a mapping that allows for the reconstruction of the complete word when the ligatures are missing. For example, if I have word with a missing ligature, such as di?erent, where ? represents a missing ligature, I want to be able to reconstruct the word different. The library I wrote allows one to do this, replacing the missing ligatures in either a single word or an entire body of text (as one may extract from a PDF file).
The library also allowed me take a look at some statistics related to words
containing ligatures. I combined the words from the Natural Language
Toolkit's "words" corpus (which is just the words in
Unix systems), and the list of words in this
repository to create a corpus of
English words to examine. This gave me a grand total of 473,350 words (including
a lot of very uncommon or niche acronyms and abbreviations and other such
things). Of those, 16,199 (3.42%) contain one or two of the common ligatures
listed above (no words contain three or more).
Crucially, of the 16,199 words containing ligatures, only a mere 244 (1.51% of words with ligatures; 0.0005% of all words) became ambiguous when their ligatures were removed. What I mean by ambiguous is that, given the parts of the word that aren't a ligature, and some identifier for where a ligature is supposed to be, more than one word could fit the description. An example is ?at, which can be reconstructed to either flat or fiat. Luckily, almost all such ambiguous words have only one candidate that is at all common, if that. For example, ?ord can match either fjord or fiord, but I can't say I've ever come across the word fiord (until I did this ligature work; it's just an alternate spelling of fjord). These statistics were generated by this Python script.
The takeaway from this investigation is that it is possible to reconstruct text that is missing ligatures with very high accuracy. In the case of extracting text from PDFs, which motivated me initially, this could be used as a post-processing step if I find myself dealing with a lot of PDFs with missing ligatures.