Let’s face it, software fails. Failing can mean many things but most often it means that it’s unavailable completely or in part. That’s not the sort of failure I’m discussing here. At ADV, we often develop processes that move data around and typically that data is of very high value (medical records, corporate records pertaining to complex legal affair, long-term historical information). Sometimes those processes are one-off conversions and sometimes they’re ongoing processes but regardless, we start with the assumption that the software will experience a failure. If we’ve done our job well, that failure won’t be our failure but the result of an external factor such as a database server going offline for maintenance while we’re relying on it, a network connection failing, a power loss, or something similar. My favorite example involved a wireless network link between two buildings (predating wifi) that was scrambling packets and, somehow, the whole stack of networking and operating system software and hardware that our application relied on to be reliable was not reporting the failure to our application. I’m oh-so-glad the client’s networking staff discovered that and didn’t blame our software.
Of course, we aren’t perfect either and our software has errors as well. Such is the nature of developing software, particularly under tight budget constraints. Thankfully, as important as some of our software is, we aren’t developing air traffic control software. I’d love to spend time studying how they guarantee reliability but I’m not sure whether I’d like writing that software!
Before continuing, let me clarify that while stability of software is very important, reliability is often more important. For my purposes from here on, my assumption is that reliability does not include stability. It’s very important that even very stable software fails and recovers reliably.
My first job involved working with financial aid applications and need analysis for private, largely parochial, high schools. The databases at the time were flat files (DBF actually) where each file represented a single table of data. Complicating matters was that an update might require updates to two tables. If one table was updated but the other wasn’t, the data was inconsistent. Today, this is easily avoided with SQL and a transaction but the environment I was working in didn’t allow for a transaction across multiple targets, even though the business requirements did. The basic approach to solving the problem was to create a persistent (on disk) list of the steps that would be required to undo the transaction if it failed while it was in progress. If, for example, the power went out, the application would initially look for this list and perform the recorded steps such that the system could be returned to a consistent status. This might seem pretty arcane but it’s still relevant today, even while some systems provide very strong transactional reliability. More on that in a moment…
What I’ve been surprised by is the number of developers I’ve worked with or whose work I’ve reviewed that have just never considered what happens when the unexpected does. I think this is in part because systems do run so well in general or that the service that’s being provided can simply be retried by the user with no harm. When you’re updating and moving critical data though, you just can’t assume that because it worked once or even 1 million times, that it won’t fail, or that reliable supporting systems will always be reliable. Having just completed a small project requiring reliable failure, I thought I’d document a few of the techniques we use when software must fail reliably:
Use reliable underlying technologies that do a good job of reporting errors.
But, when it’s really important, don’t assume they do. You can typically sacrifice some performance by building in a verification step after updating the data. In my earlier example application with the failing wireless radio link, I added a step to reread the destination file to make sure that the file copy was truly successful. Yes, the application was slower, but speed wasn’t a priority for it.
Check the errors from your underlying support systems.
I’m amazed how many tools go out the door that don’t check error codes or have generalized error handlers. Sometimes you can cheat and just check the final result of a later operation but this is rarely safe in the long-run, particularly as the application evolves. In may seem like overkill and bloat to check and handle errors but it is a necessity if the tool is doing anything non-trivial.
Compartmentalize the updating of data to as small a procedure as possible.
This may sound obvious but often the decision logic of what to update or how to update it becomes intertwined with the programming to perform the update. If you separate the decision logic of what to update from the act of updating, your code to perform the update will typically be much more compact. This has all sorts of side benefits: reusability, clarity, etc.
Define your transaction and rollback steps.
A transaction is a series of steps that must succeed or fail as a unit. If your underlying subsystem can’t handle all aspects of your transaction (e.g. update a database table on server A, update a system via a web service, and write a change directly to a file), you will need to do so yourself. Once you’ve defined the steps your transaction will need, ask yourself what will happen if the system fails (remember the power outage problem) on any single line of code until the transaction is complete. Document what steps are required to roll the transaction back to the beginning (an undo) or what information is required to roll the transaction forward, completing it and then create a structure to store these steps.
I personally find it easiest if the rollback steps are very, very specific and don’t require the rollback code to include decision logic (E.g. action: Delete file, Parameter: file path).
Remember that the rollback infrastructure may itself be unstable.
Whether you’re writing your rollback instruction to a database table, a file, or somewhere else persistent, remember that these too can all fail. This creates a complex question of whether you write the rollback instruction before executing the action or after and how to handle a failed write to the rollback list. I don’t have a single approach to this as it’s usually been obvious to me in the given situation but if you study the field, I’m sure there are best practice design patterns.
Read the rollback file on application startup.
When the application starts up, I have it read the rollback file right after it verifies it meets all other basic requirements (E.g. database connections established, access to business system API’s confirmed, etc). If there are rollback operations, they are executed to return the system to a clean state. If it cannot be accomplished, the system shuts down again with an obvious message indicating that the system is inconsistent and likely requires manual intervention.
I never provide an option to force the tool to run while the data is inconsistent. Too many system administrators will try this and make recovery dramatically more difficult.
Test!
If you can interactively run the program, kill the application on each line of code within the update logic and test how the system recovers. If you can’t run interactively, build in some debugging code where you can trigger an application exit after each line of code. This could be a command line parameter that tells it where to immediately exit. Repeat.
It’s also helpful to have someone else review and test the code if at all possible.
Finally, randomly test failure. I’ve randomly unplugged network cables, killed processes, stopped underlying systems, deleted and changed data out of sequence while the process was running, and created all manner of data issues. Throw everything you can at making your application fail. This doesn’t remove the need for the deliberate approach of testing but often uncovers errors in your testing.
Once you’ve done this time consuming task, consider this code locked down. Any change to the updates code requires another time consuming and rigorous round of testing.
Log
Create a debug log for your application. There are many high performance logging tools and the performance impact of leaving logging on is often much less than what you might expect (don’t forget to consider the privacy and security considerations of what you log though). If you can, leave debug logging on even while the application is in production. If there is a failure, this often provides the best clues for recovery.
If you’re using .Net, I’ll put out a recommendation for nLog: http://www.nlog-project.org/. When I started working with .Net, I started using nLog. I’ve been meaning to investigate whether new .Net versions have a better logger that I should evaluate. In the meantime, nLog is fast, reliable, and extremely flexible right out of the box. And has friendly licensing.
No one should really be writing their own logger anymore. There are just so many high performance, flexible logging tools out there.
Backup
This should go without saying… But in the event that something really bad happens, a backup, a log file, and any other inputs to the system that might be necessary for a redo of the updates can take you a long, long way.
If you’re dealing with life and death (or my bank account, etc), study this a lot more closely!
Whether you’re designing pacemakers, air traffic control, or banking software, I hope you take what I’ve written with a grain of salt and study the design of reliable software a lot more closely. A quick web search will reveal a wealth of additional academic and practical research that’s well beyond my rambling blog.