In the not-so-Wild West of FCoE standards negotiation, the two big switch juggernauts – Cisco and Brocade – agreed to a compromise that will affect the way that traffic moves across networks. The good news is that this was a major obstacle to making FCoE have a chance at becoming a valid contender for networking in the data center. So now we ask, “So what?”
In reality, this is very good news. FCoE shows incredible promise, but it is by no means a “done deal.” If you’re reading this then chances are you have a good idea of what FCoE is. I’m going to assume that you understand the basics of FCoE and talk about the issues to the data center administration team, and yet hopefully this will still make sense to those who are new to the protocol.
Over the past 20-odd months I’ve been talking to many many people about FCoE and what it means to the data center – hundreds of C-level executives, server, storage and network administrators, budget-conscious procurement people, and systems integrators. For good reason they are interested in the potential for cost-savings that placing Fibre Channel over Ethernet can have, as well as the performance benefits (especially as Ethernet hits 40Gb and 100Gb speeds).
One of the things that they’re concerned with, however, is that Fibre Channel (FC) is known to be safe and reliable, while Ethernet is not. What happens if you take a safe-and-reliable protocol and place it on a traditionally unreliable network?
Ethernet doesn’t have to be unreliable, fortunately. In fact, Ethernet has a little-used feature called a “pause” which allows traffic to be treated exactly the same way as Fibre Channel does: that is, it is possible to keep traffic going in the nice, orderly line that it’s supposed to.
Think of it like ducks crossing the road.* In the world of storage networking, you want to make sure that the ducks arrive at the other side in the same order as they started. If they get out of order then the procession stops.
The big problem has been what happens when there are multiple roads to cross. Essentially Cisco wanted the Mama duck to be in the front of the line, while Brocade wanted her to be at the back.
Cisco’s argument was that by allowing the Mama duck to be in the front, where the information about how long the line of baby ducks (e.g., rest of the bits) would be located, the switch would know how to forward the frame and immediately direct her to the next road, even before the line finished crossing the first one. This would be faster than having to wait until the entire line crossed the road before then permitting her to move onward.
Performance boosts are a good thing, to be sure, but Brocade argued that this might introduce some problems. For one thing, it might prevent the switch to effectively buffer the frame (since it would be segmented – some going out and some coming in). This can become problematic; because the switch must send a message back to the sending switch if something goes wrong, having the ducks spread across multiple roads it would take extra time to determine if there is a problem in a first place. So much for the performance boost.
Brocade argued for Mama duck to be in the back, then. When Mama duck arrives across the road the switch will then forward the entire line to the next road. The idea here is that if there is a problem with the frame, then there isn’t a risk of the problem being spread across links to more than one switch. Why is this key?
Well, early adopters and testers of FCoE have noticed that there is enough of a significant time delay between the time a frame is corrupted and the pause notification is sent to seriously hinder massive scale outs. In other words, at this point in time it simply isn’t feasible or recommended to hook multiple switches together and pass traffic across them.
Think about it for a second: if you have our friendly line of ducks spread out over several roads and several switches, who becomes ultimately responsible for handling problems? How far back must our messenger go when there is a problem?
I confess when I first heard about the controversy I was initially on Cisco’s side. I thought perhaps Brocade might be stalling a bit in order to catch up with Cisco’s Nexus line, but in conversations I have had with both Cisco and Brocade people I have come to realize that Brocade’s methodology is likely the sanest in the long run.
The simple truth of the matter is that data center customers need be be able to rely on FCoE just as much as they do with FC, and there’s no way to do that unless we can start to scale the build-outs. Brocade’s method – which was the outcome of this part of the compromise – looks to be the best method of ensuring data integrity in large systems.
Now that this detail has been ironed out we can expect some movement on FCoE moving beyond just the Early Field Trial (EFT) phase.
* Please forgive the imperfect metaphor; for those people who just wanted to know the “so what” behind this news I didn’t see much of a reason to get mired in the inevitable tangent of being too technical.
Comments
Pingback: M-A-M™: Persuading the Audience « J Metz's Blog