A reader wrote to me asking to explain why I assert that “networks are supercomputers”, and how we use the idea of “quality attenuation” to model the performance of such systems. He also asked me to relate these ideas to well-known results in distributed computing literature about keeping information in distributed systems in synchronisation.
Everything we do online – email, web browsing, video streaming – is just a form of distributed computation. For example, when your browser is exchanging data with a web server in some giant data centre, two geographically separated CPUs are engaged in a collaborative process to enable a web page to be rendered and presented to the user.The key problems of how to build distributed systems have been worked on and known about since the 1980s in supercomputers. These number-crunching machines enable algorithmic outcomes, like modelling the weather or car engines. Successful and timely algorithmic results are the product of making choices over sets of potential computations and communications. Computations can be done in series on one CPU, with the results then communicated elsewhere; or in parallel, with the results aggregated and integrated together.These (future) possibilities for delivery of results can be characterised by what is called a “serial/parallel flow graph”. This describes the set of computations that could be computed at any future timescale. This can be considered the boundary of what is possible: an “information wavefront”. To steal a term from economics, it is a “production possibility frontier”. Anything beyond this frontier is infeasible; within it, is feasible (if the right resource allocation choices are made).
We always prefer results to be presented sooner rather than later. That means we sometimes want to begin processes earlier than we might otherwise have done. This is called “latency hiding”. For instance, on a freezing cold day you might turn on your car to de-ice several minutes before you plan to drive off, and fill in the time whilst it warms up by finishing your coffee. In a computer, you may pre-calculate a result that may never be used, but that speeds the computation past a bottleneck if it is needed.
There is always a certain maximum amount of concurrency to any algorithm’s execution, and that defines potential for latency hiding. In other words, we can divert a certain amount of current computational and communication resource into preparing for possible future demands for results. However, this comes at a price: exploring branches of that flow graph of possibly unbounded size. This potentially impacts delivery of ongoing algorithmic results near or at the wavefront.
In the 1970s and 1980s, a massive effort was put into maximising the algorithmic throughput and numerical processing of this data by constructing ever more concurrency. In the parallel computing world, this was limited by the number of parallel processors. They had simple and stable interconnects, so the problem was relatively tractable to model.
Then data networking came along, and distributed computations were made using computers connected together from distant locations. In doing so, we replaced those high-performance dedicated CPU interconnects with statistically-multiplexed packet networks. This changed two things: it introduced an unbounded numbers of processors, and new variability in the performance of the connectivity between them.
This in turn introduced new complexity over deciding how best to configure the computation resources and explore that serial/parallel flow graph of possibilities. You don’t want your computation to stall from a lack of inputs. Conversely, anything delivered earlier than just-in-time represented a greed for communication resources that just got in the way of someone or something else.
So how can you deal with the nature of this distributed computing game as it got more complex through the application of packet data networking?
We are interested in the constraints of all such systems, since we only want to attempt feasible algorithmic outcomes. We want tomorrow’s weather forecast computation to arrive well before tomorrow begins, and to know that is what will take place. That means trade-offs exist: we can’t ask for infinite detail, whilst getting immediate delivery.
In making these trade-offs, we have to make very specific choices. Where should I compute what, and how much resource should I dedicate to latency-hiding for future demand? This demands some form of “internal” calculation within the (distributed) computer.
That means all networks are also supercomputers, since they do internal computation as well as translocation of information between external computations. Routing is a simple example of this. Radio-access networks have clever schedulers that calculate how best to allocate resources based on sensing of the radio environment. Software-defined networking (SDN) and network function virtualisation (NFV) are just the latest expression of this eternal truth that there are internal computation needs of networks.
There are established results in the literature that discuss these constraints, such as the CAP theorem. This is a (weakly-expressed) way of saying there is a wavefront boundary between what is feasible and infeasible. What it essentially says is “causality exists”, and that you can’t overcome the arrow of time: the expanding cones of information spreading do determine a wavefront. What we care about is its nature, and how best to exploit that flow graph.
The concept of “quality attenuation” (or ΔQ) gives us the means to put a (statistical) “scale” on the serial/parallel flow graph. This then lets us calculate “inside the supercomputer” what the optimal trade-offs are. With the right mathematics, we can take full control over the trading space, and optimise to any (feasible) set of requirements.
This framing of networking sees distributed computing as a holistic activity: external computations, translocation of inputs and outputs between them, and internal network computations over how best to allocate these resources. With this framing in mind, we can begin to see how and why mainstream networking has conceptually lost its way.
When we transmit data over a network, we are trying to synchronise state between two (or more) computation processes. We can consider protocols like TCP in terms of space/time trades to create consistency and synchronisation of this state. Since TCP uses the data paths themselves for protocol signalling with a complete round-trip, it is on the wavefront. That means there’s no computational “slack” in the system for latency hiding.
The TCP algorithm is a simple data producer/consumer system, with variable buffer between to hide the delay and ensure in-sequence delivery. (Loss is dealt with through retransmission.) It is trying to keep buffer non-empty, but mistakenly focuses on keeping producer buffer non-empty, rather than the consumer. TCP thus makes poor trades. (Indeed, no single consistency model works over whole range of the wavefront.) This makes for inefficient and ineffective data networks.
The reason it makes this basic and fundamental error is that its designers thought they were in the business of delivering packets, not enabling successful and timely distributed computations. That’s fine – as long as you don’t mind getting tomorrow’s weather forecast some time next week. The next revolution in networking is to put “timely distributed computational outcomes” as the objective, and apply the best ideas from supercomputing to make them happen.
For those who wish to learn a little more, I can recommend the following presentation: The Properties and Mathematics of Data Transport Quality [PDF].
To keep up to date with the latest fresh thinking on telecommunication, please sign up for the Geddes newsletter