This is the third and final part of series of articles, the first two being The remarkable story of Future Combat Systems and Five key lessons from Future Combat Systems. In this article, Dr Neil Davies and Fred Hammond re-examine those five lessons from the different perspective: that of people doing performance engineering in the telecoms industry today.
#1: Network performance science is possible
We all constantly read articles about how the telecoms industry is about more bandwidth, and hardly, if ever, about applications or customer outcomes. We have been chasing peak burst speed, and this increasingly comes at the expense of delivering actual user value. Bandwidth has consistently been used as the preferred alternative to performance engineering.
On FCS we created a new school of thought about performance engineering. This is a demand-led model, in contrast to the supply-led one of “bandwidth”. This new approach allows everyone to intensify the use of the networks they currently have. It allowed us to demonstrably achieve a lot more predictable customer experience outcomes.
Regrettably, the industry has not yet engaged with this paradigm change. We believe that this is a major missed opportunity.
#2: Engineering constraints are not well understood
FCS was an environment of highly successful professionals who had literally done rocket science. Yet they couldn’t get out of their network providers the appropriate technical base on which to keep building their “rockets”.
As a result, they were building infeasible systems without knowing it. This is a retrograde step in human technological capability.
In telecoms today, we see a stream of performance failures or infeasible systems: RAN sharing, small cells, pseudowires, unified comms, IP voice transition, IPX, … it’s a long list that goes on and on.
These engineering failures are due to a lack of understanding of performance constraints. They have a cost to the user businesses, as well as to the telco bottom line.
#3: Networking has a weak professional ethos
One of our team recently went to the local grocery store, where the network was down. This meant there were no credit card payments. People were sent to the in-store ATM for cash. This was also connected to their store network, and thus was also down.
This small example points to a bigger problem: we need to up our game to proper engineering. Such a common failure mode should automatically raise a red flag. That requires us to reason in advance about feasibility. Too often we just unthinkingly build systems, and then let customers find out their performance envelope.
That was unacceptable in a military environment because lives were on the line. They forced people to be accountable. Telecoms infrastructure is fast moving to the same level of criticality to society at large. However, telcos are not stepping up to construct the hazard-aware safety case. Nor are they being held sufficiently accountable for their failures.
That criticality is not just about connectivity, but also about performance and reliability. We see networks which lack redundant paths, and the impact of failure then becomes huge. At the moment nobody is regulating the overall big picture to make sure networks have at least the reliability we get from the national power grid.
#4: We live in an omnipresent cult of bandwidth
A lot of network performance problems result from one single issue: we use the wrong resource model is used to describe supply and demand. Packet networks create variability of timeliness of arrival, and “bandwidth” fails to capture those effects.
An example of this today is in software defined networks (SDN). These typically use a central orchestration system to dynamically allocate “bandwidth” to different forms of demand. This resource model fails to capture its true effects on customer experience, and limits the load that can be safely applied to SDN-enabled networks.
There are many other examples of how “bandwidth” lets us down. For instance, we commonly see it used as a metric in SLAs between telcos. The result has been a breakdown in the supply chain as “bandwidth” doesn’t compose as an engineering metric.
#5: We must engineer for an imperfect world
FCS took on board that there was a thing called “success”, and that there were hazards of it not happening. We can’t deliver perfect outcomes to everyone all the time. So we need to engage with failure.
When networks are stressed or go down, we don’t want society to come to a screeching halt. There is not an endless ability of applications and users to cheaply mitigate network failure. So we need to instead take some responsibility for the order in which things go wrong, so the “right” things fail.
Until now we have been blasé and lacked the customer perspective. Suppliers tell customers “you get what you get”: it’s all rather Soviet. We’ve neither captured the necessary understanding of demand, nor have we yet built adequate mechanisms to create the resulting “right kind of failure”.
For the latest fresh thinking on telecommunications, please sign up for the free Geddes newsletter.