PDF cannot be tokenized (fixing rsc/pdf)

This is the first article in a series on Personal Expenses With Go.

TL;DR

A Modern Go PDF library

A few days after I wrote this article, someone pointed me to pdfcpu. If you landed here looking for a solution to parsing PDF in Go, then maybe give that a try.

I forked & patched rsc/go, the main Go PDF library, to support right-delimited base85 literals, which are found in many modern PDF files. Below, I give a brief description of the PDF format, then I walk through debugging and patching the Go PDF library. I conclude that it probably isn’t possible to tokenize PDF, and so we can’t really cleanly fix Go’s library without a proper rewrite.

Edited on 2022-03-30 to clarify the bug in rsc/pdf and to mention pdfcpu - a modern PDF library that someone pointed me to a few days after this article was written.

Motivations

Recently, I had the modest goal of graphing my personal expenses. Europe in 2022 is virtually cashless. (Except for Germany, where technologies invented after internal combustion are viewed as witchcraft. Happily, I live in Switzerland.) Doing this should be easy, right? Just parse my credit card statements and make a big spreadsheet, conducive of analysis.

Turns out: no. The credit card statements are PDFs, and trying to parse the PDFs crashed the PDF libraries I tried to throw at them. I did eventually find success with some of the more battle-tested implementations, such as pdftotext, but by then I wanted to see if I could figure out and fix the venerable Go PDF library by rsc.

While the (hacky, btw) fix I ended up with is just a few lines, I learned a few things about PDFs that surprised me, so I decided to write them down for future reference.

A (Very) Brief Overview of PDF

First, a quick overview of what a PDF file looks like. Informally, the PDF spec consists of three layers:

A binary file format for storing arbitrary blobs, optimized for fast lookup.
A general purpose data format sometimes called COS, similar to JSON or XML. All the text, images, fonts and other elements in a PDF are stored as COS.
A simplified variant of the PostScript language used to render the document, including interactive elements like forms.

I find that thinking of PDF in these layers is a useful abstraction, because the first two layers are simple, while the third layer is complex and, as it turns out, mostly irrelevant if you just want to extract text. But it’s important to keep in mind that the specification itself does not make a similar distinction. In fact, the spec makes only minimal references to PostScript and mentions COS not at all.

I won’t get into the last layer, which is where PDF really gets its reputation for complexity. The first two layers, on the other hand, are both simple and important to understand, so let’s quickly look at them.

I. File Structure

A PDF file is a generic container for blobs. It’s optimized for random access and iterative updates. It’s made up of four sections.

The Header, which consists of the string %PDF-X.Y with X.Y being the version number.
A flat buffer called Body, which contains arbitrary objects (“object” here means blob, like “heap object”).
The Cross-reference table listing the file offsets in the Body where objects begin, along with their IDs, and whether or not the object is alive or free for reuse.
The Trailer, which specifies the offset to the Cross-reference table, along with the ID of the root object and some other metadata.

To parse a PDF, the reader finds the Trailer at the end of the file, reads the Cross-reference table, then reads the root object, which contains direct or indirect references to all the other objects.

All objects in a PDF are stored serialized in the COS format.

II. COS: The Carousel Object System

COS is an old serialization format. It’s similar to JSON, but actually predates it by decades. The PDF spec doesn’t officially mention COS, but I would bet that any conforming COS parser could parse 99% PDF objects without problems.

The language consists of the usual scalar types like integers and strings, composite types like arrays and dictionaries, and also some special types. It’s best understood from an example. (Explanations in comments.)

% This is a comment, starting from the % sign and through the end of the line.

<<  % This line begins a dictionary: an associative container.

% The /Size token is a name, and 114 is an integer.
% A name is a unique identifier with no inherent meaning. It is similar to
% symbols in Ruby, except styled as /Name instead of :symbol.
%
% Together, the next line is a key-value pair, setting the key /Size to the value 114.
/Size 114

% The [ ] on the next line denote an array. This array contains two hexadecimal-encoded
% strings, each delimited by the < > brackets.
%
% While this array contains two strings, objects in arrays are not required to all be the
% same type. An array can contain anything, including other arrays.
/ID [<4E1CA7EA753568DAB678EE2819ECE61E> <4E1CA7EA753568DAB678EE2819ECE61E>]

% Names are values and can occur not just as keys, but also as dictionary values.
/Type /XRef

% The value in the key-value pair on the next line is "1 0 R", which should be read a
% single token. This is the first special type we'll see in PDF: an indirect reference. The
% first number is the unique ID of an object and the second number is its generation
% (version). The letter R stands for "Reference". The reader is expected to substitude "1 0 R"
% for the actual value, which is looked up from the global xref table we discussed in layer 1.
/Root 1 0 R

>>

Aside from the comments, I took the above directly from a PDF file I had lying around. Specifically, this is the dictionary in the Trailer.

Aside from supporting references between objects (the 1 0 R token), COS has one more special trick up its sleeve: it can encode and compress strings in a variety of efficient formats. This is called a stream, and an example is shown below. The entire object with the ID 7 is binary data compressed using the flate algorithm.

% Declaration of PDF object ID 7, version (generation) 0
7 0 obj
<<
% A COS dictionary comes before the stream keyword.
/Length 2596 % The stream contains 2596 bytes.
/Filter /FlateDecode % The stream is decoded using flate (the gzip algorithm).
>>
stream
% 2596 bytes of binary data here
endstream
endobj

As you may have guessed, other PDF objects in the same document can refer to the data in this object (which is a JPEG image) using the token 7 0 R, which the PDF reader will substitute for the decoded binary data.

In addition to FlateDecode, another common filter is ASCII85Decode, in which case the stream data is stored as base85 instead of compressed bytes. It looks like this:

29 0 obj
<<
/Length 47510
/Filter /ASCII85Decode
>>
stream
,p?)`/O;oc@V&#ID
% a few more thousand lines of base85 ~>
YTVEql9.U7@k#NQq]49tbL-~> % note the ~> which always delimits the end of a base85 stream
endstream
endobj

And that’s everything you need to know about COS. It’s a simple format at its core, and its extensions (streams and references) are built on top of its more primitive types: dictionaries and names.

However, the base85 format represents a gotcha for parsers, because it includes characters that look a lot like array, string and dictionary delimiters, but are supposed to be interpreted as part of the stream. Recall that parsers for JSON and XML are commonly written as two layers:

The tokenizer (lexer), which splits the bytes into a list of logical tokens, like left-bracket, name or number. (Example from the JSON spec.)
The parser, which rebuilds the logical document structure from the tokens.

A parser for COS, however, cannot use a tokenizer, because there is no way to tell, at the byte layer, whether you are reading a base85 section or regular COS tokens. This is because the base85 sections are not introduced by a clear delimiter, like the CDATA marker in XML or the quote (") token in JSON.

Instead, a base85 section in COS just begins after the word stream, but its encoding (and therefore character set) is dependent on a metadata dictionary specifying the filter, whose value can only be decided at the next layer: the parser. For this reason, parsing COS requires a more complicated algorithm that interprets the language as it’s read and adjusts which tokenizer is used based on higher-level constructs.

In the next section, I’ll look at how the Go implementation falls for this gotcha and introduce a hacky workaround.

Parsing PDFs with Go

Trying to point the Go PDF library at a document that contains base85 streams ends with a panic in lex.go:82.

panic: malformed PDF: reading at offset 0: stream not present

This isn’t really where the error occurs, though. The panic happens because of a call to errorReadCloser.Read(), which already contains the error from earlier. Quickly searching the codebase for where that type is instantiated finds just one call site: read.go:796.

// Reader returns the data contained in the stream v.
// If v.Kind() != Stream, Reader returns a ReadCloser that
// responds to all reads with a ``stream not present'' error.
func (v Value) Reader() io.ReadCloser {
	x, ok := v.data.(stream)
	if !ok {
		return &errorReadCloser{fmt.Errorf("stream not present")}
	}
	// ...
}

The call stack when we hit the return leads back to what the library (inaccurately) calls the PostScript interpreter:

func Interpret(strm Value, do func(stk *Stack, op string)) {
	rd := strm.Reader()
	/// ...
}

In fact, attaching a debugger at the line that calls Reader() reveals that the strm variable is not even an invalid stream, it’s just empty. What gives?

If the stream had been invalid instead of empty, the library should have panicked where it tries to decode the stream, since it only supports /FlateDecode, but not /ASCII85Decode. If that were the case, we could just add another case statement and implement base85 decode below.

func applyFilter(rd io.Reader, name string, param Value) io.Reader {
	switch name {
	default:
		panic("unknown filter " + name)
	case "FlateDecode":
		// ...
}

But the code never gets to this point. In fact, the entire PDF graph seems to be messed up. If you’ve been paying attention, you might already suspect that the problem is in the parser: it failed to find the stream object in the first place.

Putting a breakpoint earlier in the program, we eventually end up in a function called readKeyword() in lex.go. (All comments mine - the original code has no comments.)

// reads the next token from the file.
func (b *buffer) readKeyword() token {
	// This is just here to avoid allocating memory. The same scratch array is
	// reused across different calls. Otherwise, this is the same as creating
	// a new byte slice.
	tmp := b.tmp[:0]
	// Read the buffer one byte at a time. When you find a byte that ends the
	// token (a space or a delimiter), then you're done.
	for {
		c := b.readByte()
		if isDelim(c) || isSpace(c) {
			b.unreadByte()
			break
		}
		tmp = append(tmp, c)
	}

	b.tmp = tmp
	// `s` now contains the token.
	s := string(tmp)

	// The rest of the function is a switch statement that decides what to
	// do with the token.

	// ...
}

As a reminder, here’s what the PDF source looks like that this function is trying to read.

% Declare object ID 28
28 0 obj
<<
/Length 47510
/Filter /ASCII85Decode
>>
stream
,p?)`/O<oc@V&#IDKIHb/hf
% Lots more lines of base85
YTVEql9.U7@k#NQq]49tbL-~> % note the ~> which always delimits the end of a base85 stream
endstream
endobj

It all goes wrong when the parser tries to read the next token after stream – readKeyword() tries to read the base85 chunk as a single token (which could later be deserialized), but it gets confused by the < byte, which is a valid delimiter (isDelim('<') returns true). The lexer ends up producing a stream of nonsensical tokens, and eventually something notices and panics. We experienced one possible panic, but I am pretty sure there are other ways the library can crash after this happens.

Edited on 2022-03-28: The original text didn’t explicitly say this, but the library tries to use the /Length field to figure out the length of the stream. It does not rely on the endstream token. The library gets confused into trying to tokenize base85, and my hotfix simply lets it recover from that situation.

The Hotfix

Fixing this in an elegant way is not happening. The library has no tests and I’m worried making non-local changes would lead to other problems. We could just decode the base85 chunk when we encounter it, and return it as a token of decoded bytes, but this requires detecting that we’re reading base85 when in readKeyword(), which is hard, because the base85 contents just starts with no delimiter. The only indication that base85 is about to begin is found 5-20 tokens earlier, where the /Filter is set to /ASCII85Decode.

In the end, the only local fix I could come up with is looking for the ~> right delimiter at the end of the base85 block. This means the lexer has to frequently scan far ahead, just in case a ~> occurs, but it does, surprisingly, make everything work. The rest of the library seems happy to accept a stream with decoded contents.

And so we arrive at the fixed readKeyword().

func (b *buffer) readKeyword() token {
	tmp := b.tmp[:0]
	// Keep track the first delimiter encountered. We have to scan all the way
	// to either a space or a `~>`. The latter indicates that everything we've
	// just read has been a base85 blob (surprise!). If the base85 delim is not
	// encountered, then we must rewind to firstDelim and resume as normal.
	//
	// (Base85 blobs in Postscript are only delimited from the RIGHT. There is
	// no way to know whether we're parsing a base85 literal without scanning
	// all the way to the next whitespace.)
	firstDelim := -1
	pos := 0
	for {
		c := b.readByte()
		if isSpace(c) {
			b.unreadByte()
			break
		}
		if isDelim(c) {
			firstDelim = pos
		}
		tmp = append(tmp, c)
		pos++
	}

	if len(tmp) > 2 && bytes.Equal(tmp[len(tmp)-2:], []byte("~>")) {
		// Base85-encoded string.
		return keyword(string(tmp))
	}

	if firstDelim >= 0 {
		b.unreadBytes(len(tmp) - firstDelim)
		tmp = tmp[:firstDelim]
	}

	b.tmp = tmp
	s := string(tmp)

	// The rest of the function is the same.

Conclusion

I thought this was an interesting excursion into a file format that’s very common, but most people never bother to look into. (It probably doesn’t help that the official spec is 1600 pages and Adobe keeps moving the URL.)

As for rsc/pdf, I am impressed that the library has held up as well as it has, however it should be replaced. There are no tests and the code is hard to navigate, because it uses terminology that doesn’t match with how Adobe describes PDF these days. We are now unfortunately in a situation where Go has no good open source PDF library. (Rust is a lot better here, by the way, but it’s not as useful for quick scripting.) Somebody should really do something.

For now, the fixed fork supports base85 streams using a hack that I am not proud of. (I am a bit.)

References

PDF Specification 1.7
Adobe (1999). PostScript LANGUAGE REFERENCE third edition
Leonard Rosenthol (2013). Developing with PDF