Friday 31 August 2018

Rust: Fail Fast and Loudly

So recently I was chatting to some Rustaceans about library code and their dislike of a library that can panic (the Rust macro to unwind the stack or abort depending on your build options). The basic argument put forth was that a library should always pass a Result up to the calling code because it cannot know if the error is recoverable. The chapter of the Rust Programming Language book even lays out this binary: Unrecoverable Errors panic! while Recoverable Errors return a Result. As a researcher in debugging, I reached the point where I basically banned these terms from my lectures because they can potentially lead to this thinking that libraries cannot know they're in an unrecoverable state and so can only defer to what calls them.

Terminology

Throughout CompSci literature, some terms relating to debugging are not used consistently. I'll start with the words I use (so I never have to write this in a blog post again). To illustrate the scale of the terminology issue, enjoy this quote from the 2009 revision to the IEEE Standard Classification for Software Anomalies:
The 1993 version of IEEE 1044 characterized the term “anomaly” as a synonym for error, fault, failure, incident, flaw, problem, gripe, glitch, defect, or bug, essentially deemphasizing any distinction among those words.
A defect (also called a fault, error, coding error, or bug) in source code is a minimal fragment of code whose execution can generate an incorrect behaviour. This is for some input (which includes the environment of execution), against whatever specification exists to declare what is and is not correct behaviour for the program. A defect is still a defect even if it is not exercised or does not cascade into error / failure during a test case. Defects can be repaired by substituting in a replacement block of code into the block reported as defective; this returns the program to executing in a way that does not violate the specifications.

An error (sometimes also called a fault or infection) in program execution or modelled / simulated execution is the consequence of executing a defective block of code and the resulting creation of an erroneous state for the program. This effect may not be visible and not all errors will be exposed by surfacing as a failure. An error is the result of a defect being exercised by an execution which is susceptible to that defect.

A failure in program execution or modelled / simulated execution is the surfacing of an error state by the observation of behaviour in violation of the program specification. It is therefore correct to say that a failure was experienced for a given test case due to a chain of erroneous states that originated with the execution of a defect that caused the error.

Setting a Trap

Having muttered about the language choices made in the Rust book at the top, I'm going to also praise how they actually resolve that chapter. The final section goes into detail about the pros and cons of calling panic from your code. It defines bad state in a way I might write myself in a practical programming guide. It even offers up the type system as a way to ensure your specifications for input aren't violated, with as much of the burden placed on compile time as possible.

It's good writing but it can also potentially be read by those who really wish panics didn't exist as saying you just need to make sure every possible input into your library is valid. My stance is that an occasional small slice of invalid input being possible is actually important for code quality when writing in a language that can fail (it can also make some things a lot easier to write in practice). However, it must be clearly labelled as such, with no question about when you might panic. This is the contract you're writing and every good library should fully document the interface so there is no possibility of an unexpected panic.

To give an example from the Rust standard library (which is totally just a library and we should expect other libraries to conform to the same standards it uses - this is even more true of Rust than in other languages as Rust splits out the really core library code into the Rust core library). When you've got a vector and you need to divide it in two, split_off(at) is what you need.
Splits the collection into two at the given index.
Returns a newly allocated Self. self contains elements [0, at), and the returned Self contains elements [at, len).
Note that the capacity of self does not change.
Panics if at > len.
Here we have a clearly defined operation that does exactly what we want and comes with some important guarantees about how it operates. One of those details is that if we ask to split beyond the end of the array then it will panic.

Why does this panic rather than returning a Result and letting us decide if the error is recoverable or not? I can imagine many places where trying to split an array may not be the only thing a program can do to continue, a backup path could be constructed to continue operating under some circumstances if that failed but this library decision means the calling code cannot decide that. If you ask for a split at an invalid point then you get a panic.

It is because the library set a trap. It asks the calling code to know something about the object it wants to be manipulated. Because there is no reasonable way of asking for the array to be split in two beyond the end of the array, the only conclusion that the library can make about such a request is that it is unreasonable. We are past the point of executing a defect, we are swimming through an erroneous state, and it is time to fail so this can be caught and fixed. That also means no room to let the erroneous state accidentally ask to zero the entire storage medium it has access to and trash a last known-good state that might be used to recover later (or debug the defect). Any calling code that wishes to avoid this should catch the erroneous state and recover (if possible) before calling over the library boundary. The potential to get unreasonable requests allows our blocks of code to keep each other more honest by surfacing errors as failures.

It is a restriction that provides a higher chance of fixing a defect before we ship a product. We must strive to fail fast and sometimes that means using some small gaps between what is possible and what is permitted as traps to catch when errors have occurred. A library can be poorly constructed to panic when not expected (and declared) but the existence of panics should not itself be used as a sign that a library is of poor quality or to be avoided.