Wednesday 15 November 2017

Tabs vs Spaces is the Wrong Question

We've all been there, let's agree upon a style guide/rules for this source code so we can all work on it without anything falling apart via an inconsistent mess or the commit logs being endless back and forth over style changes. From the most basic (CR+LF or LF - because no one wants that flipping back and forward in a document's commits as people work on it from various systems) to K&R vs 1TBS vs Allman etc braces. If we can find agreement on indent styles, can we also find agreement on spaces (2? 4? 3?) vs tabs?

The problem is, this asks the wrong questions. Why does my view of this source code need to be the same as yours? Why is our version control system working so hard to force us to all agree on the most minute of details (even totally invisible ones like line breaks)? I'd always said tabs was the right answer to indentation because that way the user can set their own indentation length without changing the underlying text but this is not actually the answer. The answer is that we shouldn't be sharing source code as (pure, versioned char by char) text in the first place. We should look at what this source code actually looks like to the machine, the underlying meaning onto which we insert some decoration (comments that we should preserve at their attachment point in the source code): the AST.

Maybe I spent a year or two too many in the metaprogramming/code analysis/compiler dev space recently but this seems like not only the obvious solution (only held back by lack of inertia) but also something where everyone wins. We standardise a bit, because you'll no longer be messing around with fixed-width spacing to align notes and will have to get better at both self-documenting code and writing meaningful contained comments, and we allow the actual sharing of our source code to compare via ASTs. From there the actual storage medium can still be text documents (or some AST intermediate format) but before anyone views stuff then it goes through their own pretty printer. Whatever style guide you want to make to most clear what the code does, you apply that and that's what you see and edit (with your own settings keeping your style consistent). When you share it again then it gets turned back into the AST view and so only the actual changes are committed (to be shared back to everyone else).

Do you find K&R braces so much easier to parse? Then that's what you set your pretty printer to generate for your local copy. There's no reason why we should all be forced to view the same text document representation of source code. We all have different views on the various choices of formatting and which errors it makes easier or harder to spot. We all have different susceptibilities to inserting those defects. We should all find the best way to view source code to reduce our chances of making mistakes when writing our source code. We can start this today. Rather than spending time in meetings and maintaining or teaching the house style guide, spend the time writing tools to make your sharing layer style agnostic. Just start out by always pretty printing the code before commit into a common style and pretty printing changed documents with the user's preferred style after a pull. If you have someone on the team who tinkers with LLVM/Clang then you've probably already got the expertise to make those tools (even if you can't find something suitable that's already out there which can do what we need in reformatting for your programming language). If this idea gains traction, this is something that should be extended into the basic functionality of VCSs.

Once tools start to support this we can look at moving the VCSs to never needing the text version, just storing an AST representation. Also your commit system now automatically includes static code analysis options as you're parsing this text into an AST - no excuse for allowing commits that your compiler will choke on.

We can move into more deeply integrating this new freedom. While I may like mixedCase names/ids, another person may prefer snake_case and why do we need a common naming convention when actually we're just using different ways of compounding a list of words (the actual unique identifier). We can store them as lists of words tied to each name and pretty print then into the actual source code to suit your preference. As long as we're not tripping over any language restrictions then that should be totally fine. We don't need to fight over so much in these style guides, the compiler certainly doesn't care.

Edit: with thanks for links to projects already working to provide the tools to do exactly this: SwiftFormat & Esprima's Source Rewrite demo. I'll add that my own source rewriting tools were always built on PyCParser (which is a lot easier to throw something together with than using the transformation stuff in LLVM/Clang but also only for C99 unless you write your own grammar). I see Clang now has a mature ClangFormat tool so C/C++/Obj-C code doesn't need to poke around in the guts of (potentially moving) Clang APIs for formatting. I see that's already getting some thumbs up (with the same desire to one day fix all this via ASTs).

Also very good to hear of development teams who are already doing exactly what I've suggested above (with rewriting to a house style on commit & option to custom pretty print on pull).