Your capacity planning rules are backwards (so let's fix them)

Every single telco has got the cause and effect between its network metrics and user experience backwards. Fixing this can easily save a lot of money.

The splendid people at South Western Railway have recently introduced some upgraded trains on my local commuter line into London. For the first few days last year they even had a delicious industrial solvent “new train smell”, before the inevitable “public pong” took over.

Each carriage now has display boards to tell you the service status of underground lines, the name next station stop, and where on the train you might find the most chance of a seat. Keeping track of the seat capacity use of a train is no simple matter, and apparently it is done using a collection of sensors and heuristics to give an approximation to reality. The self-loading passenger cargo is also self-organising, once given this important data.

Telecoms networks have a rather different problem of capacity management. Packets aren’t people, and don’t intelligently organise themselves. We are responsible for designing systems to timetable the transmission, sense when the network is overcrowded, and guide packets to the tracks where the delays and cancellations are at least tolerable.

There’s one itsy bitsy problem, however, with how we’ve gone about it. All of the capacity planning rules of all telcos are backwards. Yup, every darn single one. It’s a side-effect of an immature industry still figuring out the basics of quality management and performance engineering.

So why is this happening, and not better known? Some years ago I wrote about how over-provisioning bandwidth doesn’t solve QoE problems and included the following under-appreciated chart from Kent Public Service Network.

The chart compares traditional link capacity utilisation data (green lines) with the user application performance failure risk (red crosses). These “QoE failure risk” metrics are derived from ∆Q-based measures. These capture instantaneous network performance, not periodic averages.

As you can easily see, there is a strong correlation between the network being “hot” and the user experience being at risk. The traditional capacity planning causation rules of telcos say “if the network is in average too hot, then the experience is at unacceptable risk of being poor, so add more capacity”. It’s “obvious”, right? But…

This KPSN chart offers hard evidence that this is not necessarily the case: there are many times when the network is “hot” and the experience is not at risk; conversely, there are times when it is at risk, and the network is not “hot” at all, but stone “cold”. So correlation is not causation in the direction that is commonly supposed, meaning that a lot of money is being spent upgrading networks unnecessarily.

What is actually true is the reverse: when the user experience is at risk, then there must be queues forming on a frequent basis. This in turn means it is more likely that the network is on average “hot”. After all, a queue forming is just the instant version of “heat”; every link is always being run at either 0% or 100% use.

It turns out that what matters more is the pattern of arrival of the “passengers”, and not merely how loaded each “line” is over comparatively long periods. This should come as no surprise to any rush hour commuter, but appears to be a novel insight in network resource planning.

The Waterloo to Reading line is on average easy to find a seat, but at peak hour instants at Clapham Junction you can’t even board it, the overcrowding is so bad. On the other hand, the Waterloo to Weybridge via Hounslow line is rarely completely mobbed that way, but is often very busy and hard to find a seat between Waterloo and Putney.

All customer journeys are unique, being the cumulative passing of instants over a period; no customer exactly experiences your average management statistics. Yet telcos are universally managing capacity by the average, not the (statistically summarised) instants. As the data shows, there is a difference, and it does matter.

It is a bit like the relationship between weather and climate. It’s often hot in Alaska in July, but it’s not a hot climate. But if it is commonly hot weather each day, then you’ll have a hot climate, like Singapore’s. The user experience is a weather phenomenon, whereas current capacity planning rules are using climate statistics. These don’t tell you what the “packet passengers” are experiencing on their journeys. On average in Alaska you don’t need a fur coat, but that’s not helpful journey management information.

In telecoms, capturing the averages is easy and virtually free: it’s a piece of data every router offers. In contrast, “sensing” the individual packet “journey” turns out to be hard work: so many whizz past so quickly, there are so many “junctions” to monitor, and the “packet trains” are heading in so many directions. The signallers at South Western Railway would soon be calling a strike if their passenger workload increased to packet network levels!

If you are a telco CFO reading this, then you may be wondering just how much capex you are wasting on unnecessary capacity upgrades, or conversely, are losing from churn from upgrades not being made. The good news is that “sensing” the packets and quantifying the “bad journeys” is a solved technical problem. Furthermore, it has been applied at other network operators to save oodles of money by optimising capacity planning rules, so it is proven in use.

Your board seat ambition may just depend on acquiring some upgraded sensing equipment for the experience you are delivering. If so, hit “reply” to get in touch, and we can schedule a phone call to discuss how to work together. Or maybe we can even meet in London — I’ll take the 10:08 train coming from Windsor, as there are always plenty of seats on that one…