In this 50th anniversary year of the first moon landing, everything seems to be talking about that achievement and those first steps. Including the BBC World Service’s 13 Minutes to the Moon podcast, which goes into some detail into the final moments as the crew Neil Armstrong and Buzz Aldrin separate from their command module, manned by Michael Collins, through powered descent and onto the surface. Using interviews, archive footage and those all important communication loops, the podcast goes into some detail into the trials of those final moments, including fuel shortages, missed landing sites, computer overloads and communication failures.
At each of those steps, mission control had two options: Go, or No Go.
- Fuel shortages were known, calmly acknowledged and communication optimised by removing noise from the communication loop except the essential fuel status reports. They continued, basing their Go/No Go on a stopwatch which guestimated remaining fuel. Response: Optimised communication and a secondary control when the primary control appeared unreliable
- The landing site was overshot because the horizontal progress of the landing module was too fast, Neil Armstrong took manual control to ensure that even with the guidance computer, a safe landing site could be used. Response: Manual oversight by a human being, who’s often better able to effectively respond to unfamiliar circumstances (though after significant training/experience).
- The guidance computer repeatedly had “1201” and “1202” alarms mid-descent. These were raised automatically by the guidance system because the computer was unable to schedule/execute fast enough. Through rapid analysis, mission control quickly identified their cause and allowed the landing to continue despite experiencing 4 of these. Response: Pre-planned response*, better to use time before the mission that during it.
- Communication was patchy. On top of the voice control was the all important data feed which sent essential telemetry back to Mission Control. No telemetry meant not enough data to safely continue the mission. However, whilst each Flight Controller was given the option to raise a No Go and abort the landing, each had enough confidence in what they did have that the landing was allowed to continue. Response: Intuitive confidence in known systems based on expertise and assessment of risk.
* Technically, these alarms weren’t pre-planned for, but others were. In this case, the 1202/1202 alarms were quickly analysed and a response returned.
Each example of failure here was failed forward. Risks were analysed and accepted. The risk of aborting was deemed higher than continuing. If they had to abort, although they’d practised and simulated, that scenario could never be robustly tested. Instead, they chose to acknowledge the variables they did know and work with those, mitigating if required.
Apply that to an IT project. Whilst I’ve never been involved in such a dramatic “Go/No Go” scenario, nerves can be high and failure can be damaging. When things don’t look right, the same call still needs to be made: do we rollback?
No.
In almost every situation I’ve been involved in, the risk of rolling back has been greater than “fixing forward”, that is, thinking on your feet and rapidly mitigating and fixing issues as the business operates. Past a certain point, too much has changed and you risk the real possibility of losing what has happened between deployment and the “Go/No Go” decision.
Fix it forward in the short term
Whilst one should still answer (almost humour) requests with “yes, we have a back out plan” to settle management nerves, another question that should be asked about a potential failed deployment would be “have you got a fail forward plan?”. What are the most likely outcomes of the deployment that may have failed? How will you spot them? Will you spot them quickly enough? Is there resource allocated to monitor post-deployment status and react outside of Business as Usual (BAU)? Is there a strong communication channel between users and potential problem solvers?
After deployment, a retrospective would allow not only the deployment to be analysed, but also the responses to the deployment, successful or not. A team can learn from the before, during and after in this retrospective to increase the likelihood of success in future deployments – including the ability to fail forward, accepting, managing and mitigating risk as you go.
This fail forward approach would probably strike fear in any project manager, though this approach can fit within an agile project very well, if the risk is accepted and handled. Whilst it would be ideal to test every possible scenario before a deployment, even within an agile framework, you cannot predict everything. As such a managed/phased roll out could slowly increase the risk, with rapid fixes and response plans ready in the wings if required. Each release is risky, but smaller change deployments can be smaller risks than rolling back a large change.
But, the people who can solve problems fast aren’t necessarily close enough to the front line to be able to respond to failure. Indeed, some standards mandate that developers should not be allowed anywhere near a production – or even a test – environment.
Continuous deployments in the long term
Of course, fixing forward should never become a standard element of a deployment task. This post is about recognising its role in still being able to deliver value in a failing deployment by recognising, accepting and managing risk. But this has to be balanced with an increasingly regulated commercial environment, requiring teams to accept restrictions and requirements imposed by the likes of the US’ FFIEC, Surbanes Oxley or the EU’s GDPR. Technology, project methodologies, workflows and processes are increasingly able to provide the opportunity to provide a greater level of comfort before a roll out, commonly known as “DevOps”. Using Containers helps manage the test and deployment environments and configurations, Continuous Integration separates developers from test platforms and Test-first development patterns help identify failure before it gets into source control. All of these aspects point to an Agile approach to software development and project delivery.
If an environment can provide not only the usual technical requirements:
- Test-first development
- Continuous integration from release branches, proven by testing
- Automated deployments into environments, eg. test, pro-production, production
… but also the essential cultural requirements:
- strong communication and trust between developers and administrators
- emergency workflows/pathways that permit exceptional project responses, circumventing change controls in order to expedite responses
- acceptance of management and trust in the first-responders – often developers
… then one could shoot for the moon – or rapidly built, automatically test proven, integrated code deployed into a production environment.