Many technical and policy debates about the Internet centre around its ‘end-to-end’ nature. This idea was first explicitly documented an academic paper that many consider to be its ‘constitution’. This paper contains some truths, but is fundamentally flawed in its reasoning. This is a serious matter, because an expensive effort is being made to patch and fix those flaws.
The paper is titled ‘End-to-End Arguments in System Design’ and was first published in 1981. Various versions of its core idea have been published over time, both before and after. The essence has remained the same. It is the intellectual foundation on which a world of ‘all-IP’ networking is being built.It describes an idea of ‘layers’ in networking (in a way that is already problematic). It then states that application ‘functions’ (at least, an ill-defined set related to networking) are by default best done at layers where the applications run. These application layers are at the ‘ends’ of the network, hence the ‘end-to-end argument’.
This paper has been used for decades to justify the idea of a ‘dumb pipe’ network, and the resulting design of TCP/IP. Yet there is a general industry failure to appreciate the philosophical and practical limits of the end-to-end argument. This is causing the telecoms industry to embark down a labyrinthine technological dead end, at high cost to society and the economy at large.
The historical context
The paper is a product of its time, and to comprehend its arguments we first need to understand its context. In the 1970s packet networking was still an emerging technology. Prior to such networks, every connection was a circuit. Initially these circuits were physically concatenated wires, with switches to join them. Later they became logical circuits, using time-division multiplexing (TDM).
In this circuit world, delay was effectively fixed by geography and the generation of transmission technology in use. Data loss in transit was deemed to be a network failure, and to be avoided. Hence networks were highly predictable in their delivery properties. However, they were also prohibitively expensive for computer data use, since they had to be kept largely empty when used for bursty flows.
The network pioneers of the time wanted to create a system that was more suited to general-purpose computer networking than circuits. They also wanted to fully exploit the resource sharing and cost savings of packet-based statistical multiplexing. These issues are well recorded in the history books. Whilst the Internet is famously descended from the US Department of Defense’s ARPANET, its closest technical ancestor is a French network from the early 1970s, CYCLADES. Its inventor, Louis Pouzin, noted in a 1998 interview:
“The ARPA network implemented packages in virtual circuit, our network of packets was ‘pure’, with independent datagrams, as in IP [Internet Protocol] today. IP took from the Cyclades the ‘pure’ packet and logical addressing and not the physical addressing which was used by ARPANET.”
So the ARPANET was an explicitly circuit-like networking model, and what Pouzin did was to recreate (some of) the properties of circuits using and end-to-end approach. At this point the network lost all concept of the flows it was carrying, and hence any vestiges of ‘circuitness’.
At that time, the chief concern of architects of packet networks was to be seen as a viable alternative to circuits. Since the nature of packet multiplexing was to induce variable delay, turning into loss under load, they were very sensitive to accusations of being ‘unreliable’. All the early packet data pioneers were still unconsciously circuit-like thinkers, competing against circuit technology, with a focus on maintaining ‘reliability’.
What it gets right
The paper describes the prevalent thinking of the time into the nature of generating circuit-like outcomes from unreliable data transports. It would not have become so famous if it was entirely nonsense, and gets the following basic elements right:
- There are design issues over where to place functions in a distributed computing system.
- There are trade-offs in making such choices.
- If you want absolute certainty over some circuit-like outcomes (like confidentiality through encryption, or lossless flows), then you need to perform those functions end-to-end, rather than piecemeal link-by-link in a network.
Note that combination of absoluteness and incompleteness. This is the chink which opens into an intellectual chasm, into which billions of wasted dollars have gushed. Such waste will continue, until either the end-to-end edifice collapses, or the fractured foundations are fixed.
A conceptual failure
There is a single, massive deficit in the paper: it fails to fully understand the true nature of packet networking.
Networks are machines that simulate proximity between objects that are in fact geographically dispersed. Following this pattern, the essence of packet networking is thetranslocation of data between distributed computational processes. This is everywhere and always a statistical ‘game of chance’. The computational application outcomes each user seeks are dependent on getting timely arrival of data. This requires lots of good coincidences about packet interactions, and few bad ones.
Yet nowhere does the paper mention this essential stochastic nature of the networking beast! Hence the basic building blocks of the paper’s argument are deeply flawed.
The ‘functions’ it reasons about lump together incompatible issues; their only common feature is that they have some potential for a mechanism associated with the network. Indeed, the paper reasons about ‘functions’ without creating any kind of ontology to define them.
As such, there is no separation of concerns of computation (typically not the network’s job) from data translocation between computational processes (the core of what the network does). It confuses translocation issues (a statistical process) with other non-functional issues such as confidentiality of data in transit (not a statistical process).
These ‘functions’ are then seen as being absolute in nature: either they can be performed ‘completely and correctly’ at a layer, or not. The idea of creating statisticalbounds on loss and delay is wholly absent. Since this is the necessary and sufficient condition for successful translocation of information, it is a fatal flaw: the very essence ofstatistically-multiplexed packet data networks is incorrectly modelled.
So if you’re relying on this paper to support your arguments about the future of the Internet and broadband, you’re standing on some very weakly-supported ground.
A rhetorical failure
We can now see why these mistakes were made.
Their unfortunate starting assumption – a ‘dumb’ unreliable transport that just forwards packets blindly – forced them to answer the wrong question. Since statistical multiplexing was seen as being ‘unreliable’, everything became about how to use computation as a means of compensating for the perceived deficiencies in that reliability.
This repeated concern with ‘reliability’ emerges from the use of examples based on long-lived file transfers. It implicitly says that ‘information fidelity’ (i.e. an absence of data corruption) trumps timeliness – which is only true for those bulk data examples.
Hence the end-to-end arguments are only effective for a world where your applications are BitTorrent and FTP. It is not a basis on which to build complex, modern, interactive multimedia applications with huge ranges of cost and quality needs. Using computation to compensate for failures in translocation is a losing game, because computation cannot reverse time.
The goal of statistical multiplexing is not to recreate circuits, or even attempt to approximate the ‘reliability’ of circuits. What we really want is to deliver ‘good enough’ loss and delay for the application to work, across the full breadth of applications we want to run, whilst getting high resource usage. To believe otherwise implies regression to dedicated circuits, and a reversion of computer networking to the state it was in during the mid-1960s.
A categorical failure
The underlying cause of this wrong focus on moving packets is a category error about the very nature and purpose of networking. The paper’s authors, reflecting unquestioned industry beliefs, treated packets as if they were physical things – like punched cards or tapes. Consequently, the network is seen to create value by ‘moving’ them. A ‘dumb’ transport (i.e. brawn, not brain) is deemed to create value by doing the ‘work’ of moving packets according to its best ‘effort’. Leaving a link ‘idle’ means not doing ‘work’; dropping data is virtually a sin – a sign of a bad worker.
The alternative (and better) approach is to see that value is only ever created by the timely arrival of data to enable successful computation. Shovelling data around for its own sake is not a source of value. What happens to any individual packet is essentially immaterial.
The packets are merely arbitrary divisions of data flows, and should not be treated like physical objects. They don’t ever ‘move’, but are repeatedly duplicated and erased. They aren’t individually precious, and can be recreated at no cost if erased without onward duplication. Transmitting a packet at the wrong time can create negative value, by jamming up the system for other packets. It’s fine to leave a link idle if moving a packet won’t contribute to a successful computation. Packet loss is not a form of failure, but rather a necessary (and oftentimes desirable) consequence of applying a load to a network. Some demand can be time-shifted, and loss is a good means of doing it.
The flawed basic assumptions the paper advances from, due to this category error, ensure it can never reach a good conclusion.
A mathematical failure
The paper attempts to reason about how to architect distributed computation, without any real insight into the (mathematical) nature of the ‘game of chance’ and the resource being allocated. For example, the basic properties of statistical multiplexing – such as the conservation of loss and delay – are not examined. Indeed, there is no concept of the network being a finite resource, or consideration of what happens when that resource is saturated.
Furthermore, the existence of a trading space for allocating resources does not appear in the paper. There is no discussion of what happens when you take resources away from one application data flow, and give them to another. It fails to properly consider the nature and characteristics of application demand. Hence how to match supply and demand is completely absent from its thinking process.
Indeed, one of the simplest and most profound observations about networks – and one you won’t see in the textbooks – is that they are systems of two degrees of freedom. You have load, loss and delay: fix any two, and the other is set for you. As you might imagine, how to trade loss and delay within this structure isn’t mentioned, as the end-to-end mind set was all about avoiding loss entirely. In truth, loss needs to be seen as a first-class networking citizen, not a failure mode. Loss is neither inherently good nor bad, it just needs to be allocated appropriately.
Any theory of resource allocation in networks that is blind to the basic structures the network is made from is ultimately doomed. It’s like doing chemistry without any understanding of protons, neutrons and electrons. The result is alchemy – an extrapolation of observational coincidences into an erroneous theory of no predictive value.
A definitional failure
You can see their incorrect thinking process play out in how the paper uses a canonical example of ‘reliable’ delivery of a file to illustrate its point. It says that to get ‘complete and correct’ error-free delivery you can only assure that through computation at the ends (such as checksum processes).
This is true, but unhelpful.
The ‘reliability’ it refers to tragically conflates two distinct issues: error recovery; and delay. Error recovery can be fixed with further computation. Delay cannot, unless your computer is also a time machine. Thus it misses the essential nature of the function of every network node, which is to make (good) choices over allocation of loss and delay.
Additionally, the ‘absoluteness’ of ‘complete and correct’ reliability even gets this example wrong. Data corruption is effectively packet loss, and is thus a stochastic process. (Perfection is impossible, since the checking process itself is error-prone in any physical machine that a cosmic ray can pass through.) Adding end-to-end error correction to the end points may be an unnecessary application complication. The ‘right’ approach depends on the system’s use and context.
Finally, it ignores the additional cost of doing this at the ends. This cost is not just computational. It means that any recovery from an error will now take longer than if it had been done at a single (error-prone) link. This trade-off may (or may not) be worthwhile. The paper wrongly assumes that error correction in the network adds unnecessary delay, whilst ignoring that error correction at the ends may add delay that causes application failure.
Their approach turns all failures into end-to-end ones: this maximises expense and latency, and minimises scalability. That is why DARPA is now seeking alternatives to the end-to-end approach in its call for a new approach to mobile ad-hoc networks. They know that end-to-end it is a technological dead-end, and specifically request you do not follow its precepts.
An engineering failure
As we can now see, the basis of engineering is to make trade-offs in order to satisfy some design goal. The paper also mischaracterises the nature of the trades on offer, when choosing between end-to-end or link-by-link approaches.
There is nothing to indicate the end-to-end approach is a universally appropriate default, or even the most general. Indeed, there are good reasons to suggest it is not: witness the horrible complexity of home networking compared to the simplicity of plugging a phone into a wall.
The right engineering trade-off has a context-sensitive solution, and today’s systems designers fail to understand that. For instance WiFi does error correction locally on the link, as well as overlaying it end-to-end. The timing variability that the local link introduces creates undesirable interactions with TCP doing its own end-to-end control.
A more holistic systems approach is needed in future, where the components all work in collaboration to a collective goal. The trades may also be dynamic at run-time, rather than statically fixed in advance during system design. In this example, some flows may need link-level error correction, while others may not. That requires a different kind of network architecture.
Since there is no one-size-fits-all fixed design trade-off, the only viable way forward is to properly characterise the engineering trades that are on offer, and the mathematical and physical constraints that have to work within. Then our design and operational systems can help us make the right (dynamic) trades, supported by networks that contain appropriate mechanisms to enable them. Software Defined Networking (SDN) and Network Function Virtualisation (NFV) are initial steps in that direction.
A philosophical failure
The paper is also flawed at a deeper level. It includes a get-out clause in its core thesis, to acknowledge that the end-to-end approach is not always appropriate, because design-offs trades exist:
“Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.”
This is a fateful sentence in the history of science and technology, if ever there was one. Note how the paper fails to discuss the criterion for this divide, or the trade-offs involved: “sometimes… may be”.
That means the paper fails to describe any invariant property of networking. An invariant is something that is unconditionally true (even if only within the context of a model).When Einstein stated “E = mc²”, it didn’t have an asterisk saying “only tested during weekdays”. That means you can use the model to make decisions about the world.
Yet the ‘end-to-end argument’ is often referred to as a ‘principle’, when it is no such thing. It cannot be relied on as a form of truth, since the domain to which the model applies is not defined, and no invariant is stated.
Hence even on its own terms, it is an intellectual failure. It is a product of a craft approach to networking, where you tinker until things kind-of seem to work. It is not a product of science – especially since its basic hypothesis is so easy to disprove by counter-examples.
It has pre-assumed an IP-like network, and then tries to reason about the nature of network architecture from that point. Is, in effect, a circular argument that post-rationalises the bad decisions on which Internet Protocol is built.
A logical failure
There is an even deeper level of reasoning error in the paper. Remember how those problematic ‘layers’ appeared, without definition? What the paper assumes, without any reflection, is that ‘ends’ exist.
What if there are no ‘ends’ in networks?
The nature of networking is to perform inter-process communication – i.e. translocationbetween computation. This happens at a variety of scopes – from within a single multi-core CPU, to across the globe. Each scope maps to a layer, and they all have the same basic functions: divide up a data flow, send it (using the function of a narrower scope), and re-assemble it. This suggests a recursive architectural structure, without ‘ends’. (To read more on this, see this presentation by John Day.)
In contrast, the end-to-end approach assumes a ‘flat Earth’ network, in which there are definitive ‘ends’. That is a source of fundamental dissonance between Internet Protocol and the apparent natural structure of networks.
As you can now see, at pretty much every step the paper falls into deep and subtle traps. As a historical guide to what people were thinking forty years ago, it’s fine. As a map to the future, it’s worse than useless.
A practical failure
The end-to-end arguments simply don’t hold up in reality. So much so, that today’s networks don’t remotely resemble what is described by the paper, because what it says is untenable in practice. Yet the paper is venerated by many Internetorati as if it were handed down on sacred tablets.
The end-to-end nature of the Internet has encouraged its viral adoption, but that doesn’t mean it is good engineering. The Romans built their plumbing from the eponymous material (i.e. lead), and we know what that expedient choice did to them. We’re still striving to get rid of the stuff, two thousand years later! The ways in which these systems fail, and the costs and hazards they impose on users and society, really do matter.
A common false belief about the end-to-end argument is that it is either necessary or sufficient to ensure generality and generativity of networks. This is untrue, no matter how much many people want to believe it. It results in expensive and unreliable systems that fail to support many potentially valuable applications, and excludes many cost- and quality-sensitive users and uses. This is rather ironic, given its initial focus on reliability and gaining a cost advantage through statistical multiplexing gain.
So what and what next?
Moving forward, we need to work with the reality of packet networks, not the fairy tale story of networking that has been spun for decades. Packet networks are inherently statistical objects, based on space-time multiplexing. This is not going to change. The core driver of telecoms and IT is a move to virtualised and distributed computing systems. The story of the last half century has been to better exploit the statistical gain from sharing the underlying physical computation and transmission resources. This is not going to change, either.
In future, packet networks will have to support an ever-richer and more diverse demand from increasingly complex distributed applications. Many will be critical to the functioning of society: smart grids, home healthcare, teleworking, automotive services and more. To continue getting the benefits of packet networking, and for these applications to be feasible, we must align with the unchanging constraints of those statistical processes.
Facing reality is not optional: all alternative paths will end in failure.
That means the Internet’s intellectual foundations are fractured – from end-to-end. It isn’t the general-purpose data transport that society needs. Substantial changes are going to have to happen if it is to continue to grow in value and use. We need to create a new and more general distributed computing architecture, one bolted to the bedrock of mathematics and philosophy. In the meantime, a lot of money is going to be spent onunderpinning the foundations of what we’ve currently got to stop it falling over.
Given these fundamental issues, two questions naturally arise: ‘what are the foundations we should adopt for the future?’ and ‘given this is where we are at, where do we go next?’.
To keep up to date with the latest fresh thinking on telecommunication, please sign up for the Geddes newsletter