What Apollo 11 teaches software project deployments

In this 50th anniversary year of the first moon landing, everything seems to be talking about that achievement and those first steps. Including the BBC World Service’s 13 Minutes to the Moon podcast, which goes into some detail into the final moments as the crew Neil Armstrong and Buzz Aldrin separate from their command module, manned by Michael Collins, through powered descent and onto the surface. Using interviews, archive footage and those all important communication loops, the podcast goes into some detail into the trials of those final moments, including fuel shortages, missed landing sites, computer overloads and communication failures.

At each of those steps, mission control had two options: Go, or No Go.

  • Fuel shortages were known, calmly acknowledged and communication optimised by removing noise from the communication loop except the essential fuel status reports.  They continued, basing their Go/No Go on a stopwatch which guestimated remaining fuel. Response: Optimised communication and a secondary control when the primary control appeared unreliable
  • The landing site was overshot because the horizontal progress of the landing module was too fast, Neil Armstrong took manual control to ensure that even with the guidance computer, a safe landing site could be used. Response: Manual oversight by a human being, who’s often better able to effectively respond to unfamiliar circumstances (though after significant training/experience).
  • The guidance computer repeatedly had “1201” and “1202” alarms mid-descent. These were raised automatically by the guidance system because the computer was unable to schedule/execute fast enough. Through rapid analysis, mission control quickly identified their cause and allowed the landing to continue despite experiencing 4 of these. Response: Pre-planned response*, better to use time before the mission that during it.
  • Communication was patchy. On top of the voice control was the all important data feed which sent essential telemetry back to Mission Control. No telemetry meant not enough data to safely continue the mission. However, whilst each Flight Controller was given the option to raise a No Go and abort the landing, each had enough confidence in what they did have that the landing was allowed to continue. Response: Intuitive confidence in known systems based on expertise and assessment of risk.

* Technically, these alarms weren’t pre-planned for, but others were. In this case, the 1202/1202 alarms were quickly analysed and a response returned.

Each example of failure here was failed forward. Risks were analysed and accepted. The risk of aborting was deemed higher than continuing. If they had to abort, although they’d practised and simulated, that scenario could never be robustly tested. Instead, they chose to acknowledge the variables they did know and work with those, mitigating if required.

Apply that to an IT project. Whilst I’ve never been involved in such a dramatic “Go/No Go” scenario, nerves can be high and failure can be damaging. When things don’t look right, the same call still needs to be made: do we rollback?

No.

In almost every situation I’ve been involved in, the risk of rolling back has been greater than “fixing forward”, that is, thinking on your feet and rapidly mitigating and fixing issues as the business operates. Past a certain point, too much has changed and you risk the real possibility of losing what has happened between deployment and the “Go/No Go” decision.

Fix it forward in the short term

Whilst one should still answer (almost humour) requests with “yes, we have a back out plan” to settle management nerves, another question that should be asked about a potential failed deployment would be “have you got a fail forward plan?”. What are the most likely outcomes of the deployment that may have failed? How will you spot them? Will you spot them quickly enough? Is there resource allocated to monitor post-deployment status and react outside of Business as Usual (BAU)? Is there a strong communication channel between users and potential problem solvers?

After deployment, a retrospective would allow not only the deployment to be analysed, but also the responses to the deployment, successful or not. A team can learn from the before, during and after in this retrospective to increase the likelihood of success in future deployments – including the ability to fail forward, accepting, managing and mitigating risk as you go.

This fail forward approach would probably strike fear in any project manager, though this approach can fit within an agile project very well, if the risk is accepted and handled. Whilst it would be ideal to test every possible scenario before a deployment, even within an agile framework, you cannot predict everything. As such a managed/phased roll out could slowly increase the risk, with rapid fixes and response plans ready in the wings if required. Each release is risky, but smaller change deployments can be smaller risks than rolling back a large change.

But, the people who can solve problems fast aren’t necessarily close enough to the front line to be able to respond to failure. Indeed, some standards mandate that developers should not be allowed anywhere near a production – or even a test – environment.

Continuous deployments in the long term

Of course, fixing forward should never become a standard element of a deployment task. This post is about recognising its role in still being able to deliver value in a failing deployment by recognising, accepting and managing risk. But this has to be balanced with an increasingly regulated commercial environment, requiring teams to accept restrictions and requirements imposed by the likes of the US’ FFIEC, Surbanes Oxley or the EU’s GDPR. Technology, project methodologies, workflows and processes are increasingly able to provide the opportunity to provide a greater level of comfort before a roll out, commonly known as “DevOps”. Using Containers helps manage the test and deployment environments and configurations, Continuous Integration separates developers from test platforms and Test-first development patterns help identify failure before it gets into source control. All of these aspects point to an Agile approach to software development and project delivery.

If an environment can provide not only the usual technical requirements:

  • Test-first development
  • Continuous integration from release branches, proven by testing
  • Automated deployments into environments, eg. test, pro-production, production

… but also the essential cultural requirements:

  • strong communication and trust between developers and administrators
  • emergency workflows/pathways that permit exceptional project responses, circumventing change controls in order to expedite responses
  • acceptance of management and trust in the first-responders – often developers

… then one could shoot for the moon – or rapidly built, automatically test proven, integrated code deployed into a production environment.

Plex and DVR

I’ve been an enthusiastic user and supporter of Windows Media Centre over the years, spending a lot of money for an optimum set up that is able to command WAF (Wife Acceptance Factor). Of course, it being a standout solution that “just works”, Microsoft decided to kill it – much like other awesome tech like Kinect, Windows Phone/Mobile, Silverlight, etc. So we needed an alternative that could provide media streaming of music and films we have on our home network and schedule recording of FreeSat content – all that across the house. Until recently, that was a big ask, requiring technical know how and patience which I simply do not have.

I’ve been using Plex as a media server for a while, not altogether impressed but it has seemed to be a consumer friendly (if sometimes tempremental with connections) solution that was rich and intuitive enough to be able to possible achieve WAF. When I heard that they started supporting DVR (the missing piece) for OTA (over-the-air) television, I thought I’d give it another go on the same machine previously used to run Windows Media Centre. Officially, Plex doesn’t appear to support FreeSat, but on the off chance I thought I’d try adding a LIVE TV / DVR configuration. Imagine my surprise when my device was recognised! (Note that you need a Plex Pass to enable LIVE TV / DVR support.)

The key was that my FreeSat card (a TBS 6981) appeared as a Hauppauge WinTV-quad, which Plex is compatible with. All I had to do was set it up. By selecting options in these three entirely anonymous drop down lists – a frustrating and disappointing user experience.

After trying a few obvious combinations, it became clear that with a channel scan that takes a minute or so before failing, I’d be there for the duration. After confirming that I was actually getting a signal through my FreeSat cable by patching it into an old Humax receiver, I downloaded DVBViewer and used that as a means of quickly identifying the settings. Or rather, trying to figure out what the drop down lists meant.

Using some very basic FreeSat knowledge, I knew that we use an Astra satellite at 28.2 degrees. That was easy. But what about the other fields? One related to the LNB type and one to the what I can assume is the selection of the individual LNB within the quad-LNB I have.

Mapping these settings into Plex resulted in a grand total of 4 channels. None of which I could get. Odd. Tried it again, worked fine. Go figure.

Next came mapping guide data to the channels. This is an incredibly onerous task and immensely dull and can be particularly frustrating if the user interface isn’t optimised to the task, which Plex isn’t. You can’t sort or filter, it’s difficult to see what’s what within a small window and the similar design of the buttons can make it easy to accidentally re-scan the channels – as I did – and lose everything – as I did.

Everything seemed to work after stepping through the process. Firing up a Plex client on my mobile phone and web browser showed “LIVE TV” and I could tune into channels.

There are some limitations, though.

  • You can’t time-slip, so watch whilst a recording is in progress. I used to use this all the time to watch the news slightly late.
  • Not every client has a guide, which makes it excruciating to find programmes.
  • Not all apps have LIVE TV and DVR functionality. My Panasonic TV doesn’t, for example. (yeah, I know. My FreeSat is patched into my PC because the set is only FreeView)

It’s definitely got promise, however. Plex is pretty polished and things do mostly “just work”, which ticks my box.