2008/01/30

The growth of the Internet



Every conduit can take a 144-fiber cable

Thanks to the virtually infinite bandwidth of fiber—or rather, the empty tubes that can hold multi-fiber cables—the Internet has continued to get faster for almost four decades, with congestion only proving a temporary condition. But now, it's again uncertain if the routers that sit at the ends of these fibers will be able to keep up with the growth of the Internet.

Most of us take it on faith that if a computer creates an IP packet and sends it out with the right IP address in the "destination" field, the packet will get delivered to the computer holding that address. Anyone who's ever used the traceroute program ("tracert" on Windows) can attest to the fact that there are usually a good number of routers between any source and destination connected by the Internet.

So how do these routers know which path any given packet should take? Unsurprisingly, routers keep a list of which address block needs to be sent where; it's called the routing table. As with all things Internet-related, these routing tables have been growing steadily, and even multiply in accordance with Moore's Law in the coming years. Unfortunately, in engineering, it's often the case that something that works well on a small scale doesn't work on a (much) larger scale. So "scalability" is always a concern—doubly so for anything related to the Internet with its unprecedented growth rates.

Building routing tables with BGP


Conduits waiting to be put in the ground

First, a little background. When a router starts up at first, it only has local destinations in its routing table. With the help of routing protocols, though, routers are able to build a complete table so that they know where to send packets to all possible destinations.

There are several routing protocols that organizations can use locally in their own networks, but only one that spans networks that belong to different organizations. This is BGP, the Border Gateway Protocol. Back in the day, "gateway" was a term used for what we now call a router, which makes BGP "the protocol that runs between routers on both sides of the border between neighboring networks."

Currently, nearly a quarter million address blocks are injected into BGP. This means that the large routers at the core of the Internet must search through at least a quarter million address blocks for every single packet that they forward. This can easily reach one or two million packets per second for each 10 Gigabit Ethernet interface in such a router, given the Internet's average packet size of around 500 bytes.

Overcoming DRAM slowness

Internet routers maintain two (sometimes more) distinct tables: the RIB and the FIB. The Routing Information Base contains the information exchanged through BGP. In most cases, there are several ways to get to a certain destination, and the RIB tends to hold several of these pathways for each address block, along with the information that is necessary to suppress routing loops. This table is kept in DRAM and maintained by a regular CPU running software.

The Forwarding Information Base is a table that's highly optimized for looking up the next hop for a destination during the forwarding process. A routing table lookup needs at least a couple of memory cycles, which are highly unpredictable from one packet to the next—bad for caching—so in order to work at 10Gbps speeds, the FIB needs to be held in faster and more expensive SRAM. Alternatively, routers may use Content Addressable Memory (CAM) to quickly find information in the FIB. However, CAMs, traditionally used in Ethernet switches, are notoriously power-hungry at high speeds.


Scaling to larger and larger routing tables

For some time now, the community of "inter-domain routing" operators has been concerned about the growth in both the size of the global routing table and the growth in the number of routing updates that routers must process.

This has happened before: in the early 1990s, the number of class B address blocks (with 65,536 addresses each) started running out, so organizations started using a number of class C blocks (256 addresses each). However, this meant that every new organization that connected to the Internet needed four to sixteen routing table slots—one for each class C block—rather than a single one for a class B block.

The routers of the time could only handle a few thousand routes, which quickly became a problem. In 1993, the main router vendors were just implementing a new version of BGP, which doesn't look at the traditional class A/B/C addressing model but can carry address blocks that are any even power of two in size.

Around 1997, there were again problems as buggy BGP implementations caused "flapping" where BGP sessions kept going down and coming back up, using up CPU time in routers around the world. Back then, "flap damping" saved the day.

Today's and tomorrow's problems

A little over a year ago, the Internet Architecture Board (IAB), which guards the architectural underpinnings of the IETF's work, organized a workshop about this topic: the routing and addressing workshop (RAWS). Earlier this year, the report from the workshop was published as RFC 4984. Let me quote a bit:

One surprising outcome of the workshop was the observation made by Tony Li about the relationship between "Moore's Law" [ML] and our ability to build cost-effective, high-performance routers. "Moore's Law" is the empirical observation that the transistor density of integrated circuits, with respect to minimum component cost, doubles roughly every 24 months. A commonly held wisdom is that Moore's law would save the day by ensuring that technology will continue to scale at historical rates that surpass the growth rate of routing information handled by core router hardware. However, Li pointed out that Moore's Law does not apply to building high-end routers as far as the cost is concerned.

However, Moore's law has increased the amount of transistors companies like AMD and Intel can put on a CPU die so much that it's now more efficient to put the cache there as well, so fast SRAMs are no longer a commodity that goes into every PC. Those SRAMs are now a specialty product that isn't particularly cheap or developed particularly fast. So rather than buy commodity chips on the open market, router vendors are now forced to bankroll the development of hardware that supports faster FIB lookups. With the jump to 40Gbps and 100Gbps coming in the next years, this will be a challenge—a challenge that the biggest router vendors can probably meet, though the resulting products won't be cheap.


In the meantime, the RIB is having some trouble of its own. Not only are there many network operators that inject more prefixes (BGP lingo for address blocks) into the global routing system than are strictly necessary, but there are also a few that insist on sending huge numbers of updates every day. Apparently, their links keep going up and down and up again and down again all day.

The design of BGP is such that when this happens on one remote corner of the planet, every other BGP router in the world has to process these updates. Both the size of the table and the number of updates grow at healthy rates, while at the same time DRAM speeds only increase marginally every year. This means that at some point, the route processors can no longer get at the RIB fast enough to perform all the updates in real time.

But didn't we have flap damping for that? Yes and no. It turns out that flap damping can actually be harmful if it's deployed in an uncoordinated fashion. And if there's one thing that's hard to do, it's coordinating (herding cats, in IETF parlance) the network operations people that are in charge of global inter-domain routing.

The source of the problem

It's not entirely clear where the growth of the global routing table comes from. If you dust off your telnet client and connect to route-views.oregon-ix.net, you can log in to a publicly-available Cisco router that gets copies of the BGP data from a large number of ISPs, which you can inspect using the command "show ip bgp." (Also try "show ip bgp

".)

If you do that, you'll often see consecutive address blocks with identical properties, which could just as easily be expressed as a smaller number of larger blocks. So apparently, a large part of the size of the Internet's routing table is essentially pollution from people who can't be bothered to clean up after themselves. In other cases, the blocks ultimately lead to the same destination, but through different intermediate networks. In those cases, the destination in question is doing "traffic engineering," balancing the flow of traffic over the available links towards the Internet that are most favorable.

ISPs tend to inject a number of different blocks into the global routing system because, as they grow, they obtain new address space. However, the number of ISPs is stable or even declining.

There are also organizations that aren't ISPs who have their own presence in BGP. Normally, end-users ride along on the BGP advertisements from their ISPs, but "multihomers," organizations that connect to the 'Net through two or more ISPs, have to do this themselves.

Finally, some organizations are present in BGP just because they can. This way, they get to take their IP addresses with them when they switch ISPs, and avoid the hassles and costs involved with renumbering their network. Both traffic engineering and multihoming are often cited as primary reasons the global routing table keeps growing, but it's anybody's guess to what degree both of these will contribute to future growth.

Solving the problem

So what's an Internet engineer to do when her tables grow by 16 percent a year while RAM speed only grows by 10 percent annually? Two things, mostly: either complain that the previous generation of engineers didn't do a good job, or (if she's part of that generation herself) point out that she proposed the perfect solution a decade ago but nobody listened. That's probably a bit unfair, but it's not without a kernel of truth.

However, both the IETF and its research-focused sibling the Internet Research Task Force have studied the problem as a whole or certain aspects of it over the past decade. When IPv6 was developed, this was seen by many as an opportunity to fix the routing scalability problem as well. However, the argument that you can only make so many changes at once won out—along with the fact that back then there was no easy way to solve the routing issue, either. A few years later, Mike O'Dell wrote up the famous "8+8" or GSE proposal. The idea behind it is to allow routers to rewrite the upper 8 bytes of the 16-byte IPv6 address and hosts only look at the lower 8 bytes. This addresses multihoming, traffic engineering, and provider independent addressing. However, the proposal was never developed any further and suffers from a number of issues.

Early this century, the IETF started the "multi6" working group that was chartered to look at scalable multihoming for IPv6. The fear was that with the increased IPv6 address space, multihoming could become a big problem. The job of multi6 was to explore ways to do multihoming that don't tax the routing system. After many years and numerous proposals, this led to the shim6 working group, which is now about to publish specifications that allow a "shim" layer in the TCP/IP stack to dynamically move ongoing communication from one address to another. Under this plan, a computer might have an address from ISP A and an address from ISP B. When it's communicating over ISP A, but that ISP goes down, shim6 simply moves the communication session to the address from ISP B. Shim6 could be useful for home networks and small offices, but operators of larger networks don't see much to like in this approach.

Currently, the IRTF's Routing Research Group (RRG) is evaluating a number of proposals, with most of them falling under the "jack up" moniker. The jack up class of approaches basically takes all the addresses used by end-users and removes them from the current inter-domain routing system. These addresses are then tunneled through the core of the Internet; devices that sit close to the user/ISP boundary encapsulate packets to provider independent addresses into another IP packet. This IP packet is then sent to an equivalent device close to the user/ISP boundary at the destination's end, where the original packet is decapsulated and sent on to its intended destination.

By hiding the addresses that the users see from the big core routers in this way, it's possible to make those core routers faster and cheaper. However, it does mean it's necessary to deploy encapsulation/decapsulation devices at the edges of ISP networks (or even in end-user networks) everywhere. This would be the first big change in Internet routing in a decade and a half. But if it doesn't work, perhaps some more cat herding will do the trick.

No comments: