The Hardest Program I've Ever Written

September 08, 2015 code dart

The hardest program I’ve ever written, once you strip out the whitespace, is 3,835 lines long. That handful of code took me almost a year to write. Granted, that doesn’t take into account the code that didn’t make it. The commit history shows that I deleted 20,704 lines of code over that time. Every surviving line has about three fallen comrades.

If it took that much thrashing to get it right, you’d expect it to do something pretty deep right? Maybe a low-level hardware interface or some wicked graphics demo with tons of math and pumping early-90s-style techno? A likely-to-turn-evil machine learning AI Skynet thing?

Nope. It reads in a string and writes out a string. The only difference between the input and output strings is that it modifies some of the whitespace characters. I’m talking, of course, about an automated code formatter.

Introducing dartfmt

I work on the Dart programming language. Part of my job is helping make more Dart code, readable, idiomatic, and consistent, which is why I ended up writing our style guide. That was a good first step, but any style guide written in English is either so brief that it’s ambiguous, or so long that no one reads it.

Go’s “gofmt” tool showed a better solution: automatically format everything. Code is easier to read and contribute to because it’s already in the style you’re used to. Even if the output of the formatter isn’t great, it ends those interminable soul-crushing arguments on code reviews about formatting.

Of course, I still have to sell users on running the formatter in the first place. For that, having great output really does matter. Also, I’m pretty picky with the formatting in my own code, and I didn’t want to tell users to use a tool that I didn’t use myself.

Getting that kind of quality means applying pretty sophisticated formatting rules. That in turn makes performance difficult. I knew balancing quality and speed would be hard, but I didn’t realize just how deep the rabbit hole went.

I have finally emerged back into the sun, and I’m pleased with what I brought back. I like the output, and the performance is solid. On my laptop, it can blow through over two million lines of code in about 45 seconds, using a single core.

Why is formatting hard?

At this point, you’re probably thinking, “Wait. What’s so hard about formatting?” After you’ve parsed, can’t you just walk the AST and pretty-print it with some whitespace?

If every statement fit within the column limit of the page, yup. It’s a piece of cake. (I think that’s what gofmt does.) But our formatter also keeps your code within the line length limit. That means adding line breaks (or “splits” as the formatter calls them), and determining the best place to add those is famously hard.

Check out this guy:

experimentalBootstrap = document.querySelectorAll('link').any((link) =>
    link.attributes['rel'] == 'import' &&
        link.attributes['href'] == POLYMER_EXPERIMENTAL_HTML);

There are thirteen places where a line break is possible here according to our style rules. That’s 8,192 different combinations if we brute force them all 1. The search space we have to cover is exponentially large, and even ranking different solutions is a subtle problem. Is it better to split before the .any()? Why or why not?

What is up with the skulls?

I had two goals with this article: to explain how dartfmt works, and to show a realistic picture of how a real programmer solves a difficult problem with all of the messiness that entails. Alas, the first is more than long enough to try your patience, so I shunted all of the dead ends and failed attempts to footnotes. Click the skulls to laugh at my misfortune.

In Dart, we made things harder on ourselves. We have anonymous functions, lots of higher-order functions, and—until we added async and await—used futures for concurrency. That means lots of callbacks and lots of long method chains. Some Dart users really dig a functional style and appear to be playing a game where whoever crams the most work before a single semicolon wins.

Here’s real code from an amateur player:

_bindAssignablePropsOn.forEach((String eventName) => node
    .addEventListener(eventName, (_) => => bindAssignableProps
        .forEach((propAndExp) => propAndExp[1].assign(
            scope.context, jsNode[propAndExp[0]])))));

Yeah, that’s four nested functions. 1,048,576 ways to split that one. Here’s one of the best that I’ve found. This is what a pro player brings to the game:

return doughnutFryer
    .then((_) => _frostingGlazer.start())
    .then((_) => Future.wait([
    .then((_) => tellEveryoneDonutsAreJustAboutDone())
    .then((_) => Future.wait([
            .timeout(scriptLoadingTimeout, onTimeout: _handleBakingFailures)
    .then((_) {"Let's eat!");

(The funny names are because this was sanitized from internal code.) That’s a single statement, all 565 characters of it. There are about 549 billion ways we could line break it.

Ultimately, this is what the formatter does. It applies some fairly sophisticated ranking rules to find the best set of line breaks from an exponential solution space. Note that “best” is a property of the entire statement being formatted. A line break changes the indentation of the remainder of the statement, which in turn affects which other line breaks are needed. Sorry, Knuth. No dynamic programming this time 2.

I think the formatter does a good job, but how it does it is a mystery to users. People get spooked when robots surprise them, so I thought I would trace the inner workings of its metal mind. And maybe try to justify to myself why it took me a year to write a program whose behavior in many ways is indistinguishable from cat.

How the formatter sees your code

As you’d expect from a program that works on source code, the formatter is structured much like a compiler. It has a front end that parses your code and converts that to an intermediate representation 3. It does some optimization and clean up on that 4, and then the IR goes to a back end 5 that produces the final output. The main objects here are chunks, rules, and spans.


A chunk is an atomic unit of formatting. It’s a contiguous region of characters that we know will not contain any line breaks. Given this code:

format /* comment */ this;

We break it into these chunks: format /* comment */ this;.

Chunks are similar to a token in a conventional compiler, but they tend to be, well, chunkier. Often, the text for several tokens ends up in the same chunk, like this and ; here. If a line break can never occur between two tokens, they end up in the same chunk 6.

Chunks are mostly linear. For example, given an expression like:

some(nested, function(ca + ll))

We chunk it to the flat list: some( nested, function( ca + ll)).

We could treat an entire source file like a single flat sequence of chunks, but it would take forever and a day to line break the whole thing 7. With things like long chains of asynchronous code, a single “statement” may be hundreds of lines of code containing several nested functions or collections that each contain their own piles of code.

We can’t treat those nested functions or collection literals entirely independently because the surrounding expression affects how they are indented. That in turn affects how long their lines are. Indent a function body two more spaces and now its statements have two fewer spaces before they hit the end of the line.

Instead, we treat nested block bodies as a separate little list of chunks to be formatted mostly on their own but subordinate to where they appear. The chunk that begins one of these literals, like the { preceding a function or map, contains a list of child block chunks for the contained block. In other words, chunks do form a tree, but one that only reflects block nesting, not expressions.

The end of a chunk marks the point where a split may occur in the final output, and the chunk has some data describing it 8. It keeps track of whether a blank line should be added between the chunks (like between two class definitions), how much the next line should be indented, and the expression nesting depth at that point in the code.

The most important bit of data about the split is the rule that controls it 9.


Each potential split in the program is owned by a rule. A single rule may own the splits of several chunks. For example, a series of binary operators of the same kind like a + b + c + d uses a single rule for the splits after each + operator.

A rule controls which of its splits break and which don’t. It determines this based on the state that the rule is in, which it calls its value. You can think of a rule like a dial and the value is what you’ve turned it to. Given a value, the rule will tell you which of its chunks get split.

The simplest rule is a “hard split” rule. It says that its chunk always splits, so it only has one value: 0. This is useful for things like line comments where you always need to split after it, even in the middle of an expression 10.

Then there is a “simple” split rule. It allows two values: 0 means none of its chunks split and 1 means they all do. Since most splits are independent of the others, this gets used for most of the splits in the program.

Beyond that, there are a handful of special-case rules. These are used in places where we want to more precisely control the configuration of a set of splits. For example, the positional argument list in a function list is controlled by a single rule. A function call like:

function(first, second, third)

Will have splits after function(, first,, second,, and third). They are all owned by a single rule that only allows the following configurations:

// 0: Don't split at all.
function(first, second, third)

// 1: Split before the first.
    first, second, third)

// 2: Split before only the last argument.
function(first, second,

// 3: Split before only the middle argument.
    second, third)

// 4: Split before all of them.

Having a single rule for this instead of individual rules for each argument lets us prohibit things like:

    first, second,


Grouping a range of splits under a single rule helps us prevent split configurations we want to avoid like this, but it’s not enough. There are more complex constraints we want to enforce like: “if a split occurs inside a list element, the list should split too”. That avoids output like this:

[first, second +
    third, fourth]

Here, the list and the + expression have their own rules, but those rules need to interact. If the + takes value 1, the list rule needs to as well. To support this, rules can constrain each other. Any rule can limit the values another rule is allowed to take based on its own value. Typically, this is used to make a rule inside a nested expression force the rules surrounding itself to split when it does 11.

Finally, each rule has a cost. This is a numeric penalty that applies when any of that rule’s chunks are split. This helps us determine which sets of splits are better or worse than others 12.

Rule costs are only part of how overall fitness is calculated. Most of the cost calculation comes from spans.


A span marks a series of contiguous chunks that we want to avoid splitting. I picture it like a rubber band stretching around them. If a split happens in any of those chunks, the span is broken. When that happens, the solution is penalized based on the cost of the span.

Spans can nest arbitrarily deeply. In an expression like:

function(first(a, b), second(c, d))

There will be spans around a, b and c, d to try to keep those argument lists from splitting, but also another span around first(a, b), second(c, d) to keep the outer argument list from splitting.

If a split occurs between a, and b, the a, b span splits, but so does the first(a, b), second(c, d) one. However, if a split occurs after first(a, b), then the a, b span is still fine. In this way, spans teach the formatter to prefer splitting at a higher level of nesting when possible since it breaks fewer nested spans.

Parsing source to chunks

Converting your raw source code to this representation is fairly straightforward. The formatter uses the wonderful analyzer package to parse your code to an AST. This gives us a tree structure that represents every single byte of your program. Unlike many ASTs, it even includes comments.

Once we have that, the formatter does a top-down traversal of the tree. As it walks, it writes out chunks, rules, and spans for the various grammar productions. This is where the formatting “style” is determined.

There’s no rocket science here, but there are a lot of hairy corner cases. Comments can appear in weird places. We have to handle weird things like:

function(argument, // comment

Here, we normally would have a split after the first argument owned by an argument list rule. But the line comment adheres to the , and has a hard split after it, so we need to make sure the argument list rule handles that.

Whitespace is only implicitly tracked by the AST so we have to reconstitute it in the few places where your original whitespace affects the output. Having a detailed test suite really helps here.

Once we’ve visited the entire tree, the AST has been converted to a tree of chunks and a bunch of spans wrapped around pieces of it.

Formatting chunks

We’ve got ourselves a big tree of chunks owned by a slew of rules. Earlier, I said a rule is like a knob. Now we get to dial them in.

Doing this naïvely is infeasible. Even a small source file contains hundreds of individual rules and the set of possible solutions is exponential in the number of rules.

The first thing we do is divide the chunk list into regions we know can’t interfere with each other. These are roughly “lines” of code. So with:


We know that how we split the first statement has no effect on the second one. So we run through the list of chunks and break them into shorter lists whenever we hit a hard split that isn’t nested inside an expression.

Each of these shorter chunk lists is fed to the line splitter. Its job is to pick the best set of values for all the rules used by the chunks in the line. In most cases, this is trivial: if the whole line fits on the page, every rule gets set to zero—no splits—and we’re done.

When a line doesn’t fit, the splitter has to figure out which combination of rule values produces the best result. That is:

  1. The one with the fewest characters that go over the column limit.

  2. The one with the lowest cost, based on which rules and spans were split.

Calculating the cost for a set of rule values is pretty easy, but there are still way too many permutations to brute force it. If we can’t brute force it, how do we do it?

How line splitting works

I’m a college dropout so my knowledge of algorithms was fairly, um, rudimentary. So before I interviewed at Google, I spent two days in a hotel room cramming as many of them—mostly graph traversal—in my head as I could. At the time, I thought graphs would never come up in the interviews…

Then I had multiple interview questions that reduced down to doing the right kind of traversal over a graph. At the time, I thought this stuff would never be relevant to my actual job…

Then I spent the past few years at Google discovering that damn near every program I have to write can be reduced down to some kind of graph search. I wrote a package manager where dependencies are a transitive closure and version constraint solving is graph based. My hobby roguelike uses graphs for pathfinding. Graphs out the wazoo. I can do BFS in my sleep now.

Naturally, after several other failed approaches, I found that line splitting can be handled like a graph search problem 13. Each node in the graph represents a solution—a set of values for each rule. Solutions can be partial: some rules may be left with their values unbound.

From a given partial solution (including the initial “no rules bound” one), there are edge to new partial solutions. Each binds one additional rule to a value. By starting from an empty solution and walking this graph, we eventually reach complete solutions where all of the rules have been bound to values.

Graph search is great if you know where your destination is and you’re trying to find the best path. But we don’t actually know that. We don’t know what the best complete solution is. (If we did, we’ve be done already!)

Given this, no textbook graph search algorithm is sufficient. We need to apply some domain knowledge—we need to take advantage of rules and conditions implicit in the specific problem we’re solving.

After a dozen dead ends, I found three (sort of four) that are enough to get it finding the right solution quickly:

Bailing early

We are trying to minimize two soft constraints at the same time:

  1. We want to minimize the number of characters that overflow the line length limit. We can’t make this a hard constraint that there is no overflow because it’s possible for a long identifier or string literal to overflow in every solution. In that case, we still need to find the one that’s closest to fitting.

  2. We want to find the lowest cost—the fewest split rules and broken spans.

The first constraint dominates the second: we’ll prefer a solution with any cost if it fits one more character in. In practice, there is almost always a solution that does fit, so it usually comes down to picking the lowest cost solution 14.

We don’t know a priori what the cost of the winning solution will be, but we do know one useful piece of information: forcing a rule to split always increases the cost.

If we treat any unbound rule as being implicitly unsplit 15, that means the starting solution with everything unbound always has the lowest cost (zero). We can then explore outward from there in order of increasing cost by adding one rule at a time.

This is a basic best-first search: we keep a running queue of all of the partial solutions we’ve haven’t explored yet, sorted from lowest cost to highest. Each iteration, we pop a solution off.

If the solution completely fits in the page width, then we know we’ve won the overflow constraint. Since we’re exploring in order of increasing cost, we also know it’s the lowest cost. So, ta-da!, we found the winner and can stop exploring. Otherwise, if it has any unbound rules, we enqueue new solutions, each of which binds one of those to a value.

We basically explore the entire solution space in order of increasing cost. As soon as we find a solution that fits in the page, we stop.

Avoiding dead ends

The above sounds pretty promising, but it turns out that there can be an imperial ton of “low-cost but overflowing” solutions. When you’re trying to format a really long line, there are plenty of ways it can not fit, and this algorithm will try basically all of them. After all, they’re low cost since they don’t have many splits.

We need to avoid wasting time tweaking rules that aren’t part of the problem. For example, say we’re looking at a partial solution like this:

// Blog-friendly 40-char line limit:    |
    firstCall(a, b, c, d, e, f, g, h),
    secondCall("very long argument string here"));

There are a bunch of ways we can split the arguments to firstCall(), but we don’t need to. Its line already fits. The only line we need to worry about is the secondCall() one.

So, when we are expanding a partial solution, we only bind rules that have chunks on overflowing lines. If all of a rule’s chunks are on lines that already fit, we don’t mess with it. In fact, we don’t even worry about rules on any overflowing line but the first. Since tweaking the first line will affect the others, there’s no reason to worry about them yet 16.

This dramatically cuts down “branchiness” of the graph. Even though a partial solution may have dozens of unbound rules, usually only a couple are on long lines and only those get explored.

Pruning redundant branches

This gets us pretty far, but the splitter can still go off the deep end in some cases. The problem is that within large statements, you still run into cases where how you format part of the statement is mostly independent of later parts.

Take something like:

// Blog-friendly 40-char line limit:    |
new Compiler(
    assertions: options.has('checked-mode'),
    annotations: options.has('annotations'),
    primitives: options.has('primitives'),
    minify: options.has('minify'),
    preserve: options.has('preserve'),
    liveAnalysis: check(options.has('live'), options.has('analysis')),
    multi: options.has('multi'),
    sourceMap: options.has('source-map'));

Each of those named arguments can be split in a few different ways. And, since those are less nested—which means fewer split spans—than that nasty liveAnalysis: line, it will try every combination of all of them before it finally gets down to the business of splitting that check() call.

The best way to split the liveAnalysis: line is the best way to split it regardless of how we split assertions: or annotations:. In other words, there are big branches of the solution space that initially differ in irrelevant ways, but eventually reconvene to roughly the same solution. We traverse every single one of them.

What we need is a way to prune entire branches of the solution space. Given two partial solutions A and B, if we could say not just “A is better than B” but “every solution we can get to from A will be better than every solution we can get to from B” then we can discard B and the entire branch of solutions stemming from it.

It took some work, but I finally figured out that you can do this in many cases. Given two partial solutions, if one has a lower cost than the other and:

If all of those are true, then the one with a lower cost will always lead to solutions that also have a lower cost. Its entire branch wins. We can discard the other solution and everything that it leads to. Once I got this working, the formatter could line split damn near anything in record time 17.

An escape hatch

Alas, that “damn near” is significant. There are still a few cases where the formatter takes a long time. I’ve only ever seen this on machine generated code. Stuff like:

class ResolutionCopier {
  bool visitClassDeclaration(ClassDeclaration node) {
    ClassDeclaration toNode = this._toNode as ClassDeclaration;
    return javaBooleanAnd(
                                        node.metadata, toNode.metadata)),
                                    toNode.abstractKeyword)), _isEqualTokens(
                                node.classKeyword, toNode.classKeyword)),
                      ,, _isEqualNodes(
                            node.typeParameters, toNode.typeParameters)),
                            node.extendsClause, toNode.extendsClause)),
                        node.withClause, toNode.withClause)), _isEqualNodes(
                    node.implementsClause, toNode.implementsClause)),
                _isEqualTokens(node.leftBracket, toNode.leftBracket)),

Yeah, welcome to my waking nightmare. Unsurprisingly, code like this bogs down the formatter. I want dartfmt to be usable in things like presubmit scripts where it will have a ton of weird code thrown at it and it must complete in a reliable amount of time.

So there is one final escape hatch. If the line splitter tries, like, 5,000 solutions and still hasn’t found a winner yet, it just picks the best it found so far and bails.

In practice, I only see it hit this case on generated code. Thank God.

Finally, output

Once the line splitter has picked values for all of the rules, the rest is easy. The formatter walks the tree of chunks, printing their text. When a rule forces a chunk to split, it outputs a newline (or two), updates the indentation appropriately and keeps trucking.

The end result is a string of (I hope!) beautifully formatted Dart code. So much work just to add or remove a few spaces!


1 Yes, I really did brute force all of the combinations at first. It let me focus on getting the output correct before I worried about performance. Speed was fine for most statements. The other few wouldn’t finish until after the heat death of the universe.

2 For most of the time, the formatter did use dynamic programming and memoization. I felt like a wizard when I first figured out how to do it. It worked fairly well, but was a nightmare to debug.

It was highly recursive, and ensuring that the keys to the memoization table were precise enough to not cause bugs but not so precise that the cache lookups always fail was a very delicate balancing act. Over time, the amount of data needed to uniquely identify the state of a subproblem grew, including things like the entire expression nesting stack at a point in the line, and the memoization table performed worse and worse.

3 The IR evolved constantly. Spans and rules were later additions. Even the way chunks tracked indentation changed frequently. Indentation used to be stored in levels, where each level was two spaces. Then directly in spaces. Expression nesting went through a number of representations.

In all of this, the IR’s job is to balance being easy for the front-end to produce while being efficient for the back end to consume. The back end really drives this. The IR is structured to be the right data structure for the algorithm the back end wants to use.

4 Comments were the one of the biggest challenges. The formatter initially assumed there would be no newlines in some places. Who would expect a newline, say, between the keywords in abstract class? Alas, there’s nothing preventing a user from doing:

abstract // Oh, crap. A line comment.
class Foo {}

So I had to do a ton of work to make it resilient in the face of comments and newlines appearing in all sorts of weird places. There’s no single clean solution for this, just lots of edge cases and special handling.

5 The back end is where all of the performance challenges come from, and it went through two almost complete rewrites before it ended up where it is today.

6 I started from a simpler formatter written by a teammate that treated text, whitespace, and splits all as separate chunks. I unified those so that each chunk included non-whitespace text, line split information, and whitespace information if it didn’t split. That simplified a lot.

7 When I added support for better indentation of nested functions, that broke the code that split source into separately splittable regions. For a while, a single top-level statement would be split as a single unit, even if it contained nested functions with hundreds of lines of code. It was… not fast.

8 Ideally, the split information in a chunk would describe the split before the chunk’s text. This would avoid the pointless split information on the last chunk, and also solve annoying special-case handling of the indentation before the very first chunk.

I’ve tried to correct this mistake a number of times, but it causes a near-infinite number of off-by-one bugs and I just haven’t had the time to push it all the way through and fix everything.

9 Rules are a relatively recent addition. Originally each chunk’s split was handled independently. You could specify some relations between them like “if this chunk splits then this other one has to as well”, but you could not express things like “only one of these three chunks may split”

Eventually, I realized the latter is what I really needed to get argument lists formatting well, so I conceived of rules as a separate concept and rewrote the front and line splitter to work using those.

10 At first, I thought hard splits weren’t needed. Any place a mandatory newline appears (like between two statements) is a place where you could just break the list of chunks in two and line split each half independently. From the line splitter’s perspective, there would be no hard splits.

Which would work… except for line comments:

   // with a line comment

This has to be split as a single unit to get the expression nesting and indentation correct, but it also contains a mandatory newline after the line comment.

11 There used to be a separate class for a “multisplit” to directly handle forcing outer expressions to split when inner ones did. Once rules came along, they also needed to express constraints between them, and eventually those constraints were expressive enough to be able to handle the multisplit behavior directly and multisplits were removed.

12 I spent a lot of time tuning costs for different grammar productions to control how tightly bound different expressions were. The goal was to allow splits at the places where the reader thought code was “loosest”, so stuff like higher precedence expressions would have higher costs.

Tuning these costs was a nightmare. It was like a hanging mobile where tweaking one cost would unbalance all of the others. On more than one occasion, I found myself considering making them floating point instead of integers, a sure sign of madness.

It turns out spans are what you really want in order to express looseness. Nested infix operators then fall out naturally because you have more spans around the deeper nested operands. The parse tree gives it to you for free.

These days, almost every chunk and span has a cost of 1, and it’s the quantity of nested spans and contained chunks that determine where it splits.

13 I had known that clang-format worked this way for a long time, but I could never wrap my head around how to apply it to dartfmt’s richer chunk/rule/span system.

I took a lot of walks along the bike trail next to work trying to think through a way to get graph search working when the two numbers being optimized (overflow characters and cost) are in direct opposition, and we don’t even know what the goal state looks like. It took a long time before it clicked. Even then, it didn’t work at all until I figured out the right heuristics to use to optimize it.

14 For a long time, overflow and cost were treated as a single fitness function. Every overflow character just added a very high value to the cost to make the splitter strongly want to avoid them.

Splitting overflow out as a separate metric turned out to be key to getting the graph search to work because it let us order the solutions by cost independently of overflow characters.

15 I went back and forth on how an unbound rule should implicitly behave. Treating it as implicitly split gives you solutions with fewer overflow characters sooner. Treating it as unsplit gives you lower costs.

16 Oh, God. I tried a million different ways to reduce the branchiness before I hit on only looking at rules in the first long line. I’m still amazed that it works.

I could also talk about how controlling branchiness lets us avoid reaching the same state from multiple different paths. After all, it’s a graph, but everything I’ve described talks about it like it’s a tree. By carefully controlling how we extend partial solutions, we ensure we only take a single path to any given complete solution.

Before I got that working, I had to keep a “visited” set to make sure we didn’t explore the same regions twice, but just maintaining that set was a big performance sink.

17 Discarding overlapping branches is the last macro-optimization I did and its behavior is very subtle. Correctly detecting when two partial solutions overlap took a lot of iteration. Every time I thought I had it, one random weird test would fail where it accidentally collapsed two branches that would eventually diverge.

That bullet list was paid for in blood, sweat, and tears. I honestly don’t think I could have figured them out at all until late in the project when I had a comprehensive test suite.