This article follows on from “The remarkable story of Future Combat Systems”. FCS was the first project to apply the new science of network performance to a complex distributed system. In this second part, we hear from Fred Hammond and Dr Neil Davies the five key lessons from the project:
- Network performance science is possible
- Engineering constraints are not well understood
- Networking has a weak professional ethos
- We live in an omnipresent cult of bandwidth
- We must engineer for an imperfect world
Neil Davies: “Boeing were using the ‘state of the art’, and it is that state we are critiquing. Boeing deserve praise for having the foresight to realise that the questions needed to be asked.”
Fred Hammond: “The FCS project was a real eye-opener for us. Until then, we didn’t realise quite how intellectually immature the networking industry was (and still is). Every time we were told something was ‘commercially confidential’, we found it really means ‘we don’t know’ (to put it politely).”
Here are five ways that the immaturity showed itself.
#1: Network performance science is possible
For us, this was a landmark event for our industry: the beginning of a rigorous science of network performance.
What shocked us was the relationship between network supply and demand was not understood (and still isn’t!). There was a basic knowledge gap of how to express and align these.
Instead, the FCS network engineers could only reason on a hop-by-hop basis. They had no framework to model how component performance composed into the system’s overall performance. This was a key frustration for the project’s development. There was a need for “real” science and engineering that could deal with performance as a “first class” entity in the design process.
The idea of “quality attenuation” emerged as a language in which to capture these end-to-end demand requirements. This then led to Quantitative Timeliness Agreements (QTAs) that formally relate supply to demand in a rigorous manner. The “contract” is in a common mathematical language of the ΔQ calculus. This enables scientific reasoning about the relationship between supply and demand.
These new conceptual tools enabled us to (de)compose the requirements, and to model graceful degradation under load. We take this capability for granted in other engineering disciplines, but for packet networking this was (to the best of our knowledge) a world first.
#2: Engineering constraints are not well understood
The FCS systems engineers had an aerospace background. They knew that engineering requires you to understand constraints and design trade-offs. Meanwhile, the network people weren’t able to tell them what the performance constraints were. Where they did articulate them, they measured them in terms of “bandwidth”. This failed to reflect timeliness constraints.
The FCS project had a framework for dealing with aspects of the system, such as use-cases and functionality decomposition. They were using the state of the art in the system design for these. Yet even on a massive project like FCS, they didn’t have a conceptual framework for dealing with the performance risks in a systemic way at the design stage. This created a lot of project risk.
For instance, the original design criteria meant that we were not allowed to take into consideration security costs. So all the work later had be redesigned, since these had not been part the feasibility assessment. The thought behind this was that encryption was a simple bandwidth overhead, which ignored that it causes many important timing changes.
Engineering to constraints is as much about how things fail when stressed as how they work when unstressed. The models were considering “bandwidth” and “priority”, but not timeliness and consequences of non-delivery. The development engineering (spread across multiple companies) tried to create a “black and white” world. Either “everything gets through” or they had to “kill the failed application”. There was no middle ground, and the real world is all middle ground.
The state of the art then (as now) could not capture the residual hazards. The development supply chain did not capture (and hence could not surface) a plethora of systemic issues, including around the contention for common (such as network) resources. The performance hazards were latent and would have only become manifest (hopefully) during system integration.
This was an important “ah-ha!”. The very system decomposition approach (and hence the overall contractual framework) did not capture the essential aspects that would permit the components to have a hope of integrating into a usable system. Again, we must emphasise that the system engineering was state-of-the-art – well beyond much that we see today (10 to 15 years on) in large-scale system developments.
It is this inability of system decomposition (design and build) approaches to capture the essential properties, including performance, which lies at the heart of many failed system and product developments. The hazards (which were there from the start and were being “baked in” during the design and development) only mature at system integration or during the scaling of deployment. It is costing global society billions if not trillions a year.
#3: Networking has a weak professional ethos
We were amazed that the data network people would engage in safety-of-life activities with little regard to any systems engineering rigor. Note that this isn’t a telecoms issue, as they had it in the circuit world. No, it was a blind faith in Internet Protocol and the so-called “end-to-end principle”. This is like asking the fairies at the bottom of your garden to protect your safety-critical infrastructure. It is irresponsible, yet it is endemic.
These IP networking people were regarded as being “top of their game”. They knew their (limited) world very well, and some aspects of its failure modes. But they didn’t understand what were to us basic concepts, like “performance contours”, and the trading space for packet loss and delay.
Their cowboy swagger said to us “we’re the guys who know this networking stuff and can walk on water”. But they didn’t have to prove they could walk on water. Everything depended on them doing miracles—and it turned out they were real miracles—the kind that required breaking the rules of physics!
So to them, failure was unthinkable. When they had done everything possible it was no longer their problem. Instead, it became an application problem. This assumed that an application mitigation existed for every performance issue, so they were no longer responsible for delivering the outcome.
They lived in a world of generating “success”, where dealing with “failure” was someone else’s problem. In very real sense they were baking-in hazards to the system.
#4: We live in an omnipresent cult of bandwidth
Where it was pointed out that there was no mitigation, the cry was “You’ve asked for too much, give me more resources!”. The only solution to operating in overload was to demand ever more bandwidth, which on a battlefield isn’t going to happen. The ability to drive the network over 100% load, and schedule resources to reliably deliver some outcomes over others, was simply inconceivable.
Too often it was believed you could, through sheer computational power, resolve the impossible. The “thought universe” that these network element builders inhabited didn’t even acknowledge the constraints of the universe in which they worked. That was really scary.
What was even scarier was that the IP engineering crowd were used to a world where their “mistakes” could be covered up. Technological progress in data transmission had obscured the failure to make progress in basic engineering. They were, once again, relying on more bandwidth coming to their rescue for FCS (which in this case had a miracle as its prerequisite).
#5: We must engineer for an imperfect world
One thing that caused a lot of angst was that both the systems and network engineers were trying to make things perfect in an imperfect world. The Department of Defense knew that the world was imperfect, and hence things would fail. What they wanted was an indication of how much risk there was.
For example, the FCS engineers saw VoIP as being so critical that they couldn’t lose any packets. But they didn’t realise that even if you lost a packet, the message could still get through. They lacked a basic understanding of application behaviour and the resulting transport requirement.
They also didn’t allow themselves to engage in simple mitigation strategies. For example, you could put noise in to make sure that the listener knew there was an audio dropout. Indeed, since it was push-to-talk voice, it was easy to do so.
The challenge we all face is to describe “tolerable imperfection” at every conceptual level, and reason about how these are related.
In the next and final part of the series, we will examine the consequences of these lessons on the telecoms industry of today.
To keep up to date with the latest fresh thinking on telecommunication, please sign up for the Geddes newsletter