Gartner on FCoE. Whoa There, Sparky

I’m too cheap to shell out the $195 (that’s $20/page!) for the latest Gartner report on FCoE, but there have been enough reports on the report that it’s not difficult to see the gist is consistent with Gartner’s long-running hate-affair with the nascent technology. I’m the first to admit that FCoE isn’t a panacea for the data center’s myriad array of issues, but Gartner seems to be far too willing to play the convenient role of grumpy old man and needs to be placed into some perspective.

As I said, I’m too cheap to shell out $200 for a 10 page report, but fortunately several trade rags have provided direct quotes of some of the hot topics which allow me to address some of the more interesting points.

Define Your Terms

Storagenewsletter points out that Joe Skorupa, research vice president at Gartner, is a wee bit skeptical about what a converged data center core can really mean, but Skorupa seems to mix and match his terms:

“The industry is abuzz with the promise of a single converged network infrastructure, this time in the data center core,” said Joe Skorupa, research vice president at Gartner. “Alternatively described as Fibre Channel over Ethernet (FCoE), Data Center Ethernet (DCE), or more precisely, Data Center Bridging (DCB), this latest set of developments hopes to succeed where InfiniBand failed in its bid to unify computing, networking and storage networks.”

There are two issues that I have with Gartner’s statement. First, FCoE != DCB. Data Center Bridging (or Converged Enhanced Ethernet, or Data Center Ethernet – pick your poison) refers to a set of Ethernet modifications which, in turn, set the stage for Fibre Channel over Ethernet. In other words, FCoE uses DCB as a superset of standards, but you do not need to have FCoE in order to have or use DCB.

Not all DCB is FCoE

Why is this important? Because understanding the nature of FCoE as it relates to the broader spectrum of Ethernet standards can mitigate confusion. If Gartner (or anyone) conflates the two it makes it easier to make criticisms that apply to one but not the other, and since Gartner begins with this erroneous assumption we have to wonder if the criticisms adequately apply to FCoE specifically, or DCB in general.

The second problem I have with this is that yes, there is a very positive potential for converged networks in the data center core, but if there weren’t why would the industry be abuzz? Gartner claims that the “buzz” is about the core “this time,” as if to say that vendors have failed to capitalize on any earlier promises and are now attempting to shift focus elsewhere.

The potential, or promise, for a converged core is still over a year away and despite bloggers, pundits, and critics claiming that vendors have been all a-gaga about this I have yet to see an actual vendor make any claims about current promises of FCoE that have been broken.

Learning How to Count

Then we get into the crux of some of the issues:

“The promise that a single converged data center network would require fewer switches and ports doesn’t stand up to scrutiny…This is because as networks grow beyond the capacity of a single switch, ports must be dedicated to interconnecting switches. In large mesh networks, entire switches do nothing but connect switches to one another. As a result, a single converged network actually uses more ports than a separate local area network (LAN) and storage area network (SAN). Additionally, since more equipment is required, maintenance and support costs are unlikely to be reduced.”

It is true that one of the major hassles with large mesh networks is that you can often find yourself dedicating entire switches to ISLs. This is particularly true in FC-land, where you must dedicate ports as E-ports to connect. A notable exception to this rule is my alma mater, QLogic, whose nifty 5800 stackable switch dedicates high-speed ISLs for inter-switch traffic leaving standard ports available for what they were intended.

This “stackable” mentality is the driving force behind FCoE as well. Both Brocade and Cisco have designed their 50×0 and 8000 switches, respectively, to be Top of Rack (TOR) solutions. It’s an imperfect (okay, damn ugly) metaphor but for our purposes here let’s stick with it.

By using a TOR solution, which in turn leads to an End of Row (EOR) switch chassis, the data center becomes more streamlined, not cluttered. Gartner’s premise that mesh networks occupy more ports in order to sustain the mesh is the genesis for the assertion that a single converged network uses use more ports than a separate LAN or SAN.

What Gartner is missing – and is glaringly obvious when talking in real-world applications – is just how many ports are currently being used by LAN and SAN environments. When I was presenting on FCoE back in 2008 and 2009, I would routinely ask my audience just how many NICs they had installed per server. In nearly every presentation (and I did dozens), there were at least 1 or 2 members of the audience who had 16 NICs per server.

That’s 16, folx. 4 to the power of 2. 2 x 2 x 2 x 2. 16 FREAKIN‘ NICs. Dayum!

Mathematically, if I can replace 16 nics (even with 1 port only) and 4 HBAs (even with one port only) with 2 CNAs (with 2 ports each) you have port reduction.

Financial Barriers?

Gartner makes the assertion that because more ports are necessary, more equipment is required, and ergo your maintenance and support costs balloon.

All things being equal, this would be true. However all things are not equal.

For one thing, we’re talking about moving to FCoE as an evolutionary/expansionist approach, rather than Rip-n-Replace. This is a move to 10GbE which isn’t ‘free,’ as single 1GbE infrastructures are often seen. This means that you are buying one 10GbE/FCoE switch as opposed to both Ethernet and Fibre Channel switches. You are purchasing one CNA as opposed to one 10GbE NIC and one 8Gb FC HBA. You’re purchasing one cable as opposed to two. Two SFP+ transceivers per Eth/FC port combo versus 4.

This is asset reduction, and it only gets better when you start figuring out how many NICs and HBAs you’re truly replacing.

Maintenance costs? Should we get into the power and cooling costs you get when you swap out Cat 6a for TwinAx (16w vs. .1w)? In another post I’ll break it down for you in detail, but the bottom line is that for large DC deployments you’re talking over $70k/year in power reduction just from cabling.

Increased Complexity

Gartner asserts that by layering multiple protocols onto a single infrastructure, increased complexity is inevitable:

Gartner also believes that there are significant design and management issues to be addressed. When two networks are overlaid on a single infrastructure, complexity increases significantly. As traffic shares ports, line cards and inter-switch links, avoiding congestion (hot spots) becomes extremely difficult.

This is true. And it’s also not news.

Fibre Channel over Ethernet is, in effect, a method of virtualization of storage traffic. The method by which the links are created involve virtualizing the nodes and ports, which in turn abstract the links themselves. An entirely new protocol of link discovery, called FIP (Fibre Channel over Ethernet Initialization Protocol) needed to be developed in order to handle not only the addressing scheme but also the link management.

But again, why is this a shock? Is Gartner shocked, shocked I tell you! that VMware needed to increase complexity in order to virtualize hosts onto bare metal hardware, for instance? Is ESX doomed, doomed I tell you! because adding multiple hosts on a single hardware platform is more complex?

Mr. Skorupa said that over time, emerging standards, such as Transparent Interconnection of Lots of Links (TRILL) may make it easier to avoid these hot spots, but mature, standards-compliant implementations are at least two to three years away.

Well, no and yes. TRILL is not a method to handle congestion, it’s a method of maintaining link-states to mitigate temporary loop issues, a way of addressing some of the issues surrounding spanning tree. It’s a topic worth exploring on its own, but what it isn’t is a method of “hot spot handling.”

What is true is that standards-compliant implementations are a ways away. However, it’s not clear what the criticism here really is. Is Gartner complaining that they’re not here yet? Are they claiming that FCoE vendors are saying that it is? Are they saying that because it’s not available now we should stop working on it, pack up our bags and go home?

What’s your point, dude?

It’s not a Debug, it’s a De-feature!

Gartner makes a very interesting claim about debugging problems on a converged network, since the “interactions between LAN and SAN traffic can make root cause analysis more difficult.”

Since many problems are transient in nature, events must be correlated across the two virtual networks, increasing complexity. Should an outage be required for solving a problem or simply for performing maintenance, a downtime window that is acceptable for both environments may be required. This increases complexity and may increase cost, as well.

Wait, what?

When you have an Ethernet problem, what are your troubleshooting steps? Chances are if you’re an Ethernet network admin you have a series of steps to go through (I’m being sarcastic here; there are very well-known and well-tested troubleshooting techniques. Packet sniffing, anyone?). If you’re a Fibre Channel guy, you have your own troubleshooting techniques.

With Fibre Channel over Ethernet – guess what? – you sniff packets and troubleshoot the same. freakin’. way. You use the same tools you have always used because to an admin it is fibre channel and ethernet.

The notion that “problems are transient in nature, events must be correlated across the two virtual networks,” is bizarre in light of the way that FCoE packets are handled in a DCB switch. FCoE traffic is not transient across the link by any means, and to suggest that somehow LAN traffic and SAN traffic intermingle is to imply that Gartner has no clue how PFC works.

Because of the fact that FCoE necessarily is a virtualized abstraction layer the traffic does not get “correlated.” Don’t believe me? Take a look at the Ethernet frame and tell me how decisions based upon ethertypes can somehow get ‘confused.’

If an outage occurs for solving a problem “or simply performing maintenance,” Gartner appears to be concerned that admins must find an “acceptable” window for both environments. It’s a good thing that outages don’t cause that kind of grief right now, eh?

From the Sublime to the Surreal

How is this for the most bizarre conclusion ever:

“[W]hile the promise that a unified fabric will require fewer switches and ports, resulting in a simpler network that consumes less power and cooling, may go unfulfilled, that doesn’t mean that enterprises should forgo the benefits of a unified network technology.”

There are so many things wrong with this statement it’s difficult to know where to begin.

After saying that the “promise” is not available now, but in some cases only a year (or three) away, now Gartner claims that it will “go unfulfilled.” Way to throw the baby out with the bathwater on this one.

Well, there you have it. It’s not available now, so it will NEVER be available. I might as well just take my bat and ball and go home and never go outside again.

But then, with no explanation whatsoever (and no connection to the logic), Gartner qualifies the claim that enterprises shouldn’t “forgo the benefits” of a unified network technology – those same benefits that it just said would go unfulfilled. Gartner then goes on to support two separate networks in the conclusion of the press release.

In short, it appears that Gartner is attempting to switch back and forth between the circles in the Venn diagram shown above, thus reinforcing the importance or properly defining your terms in the first place.

Missed Opportunities

What’s interesting to me is how Gartner apparently missed the boat on some very real, and very legitimate concerns about FCoE and converged networks.

For one thing, congestion management is something that SAN admins have a right to be concerned about and while the Ethernet folks are getting it sorted out they’re using terminology that may not be familiar to storage network admins. There’s going to have to be some major clarification going to happen.

For another, there is still the question and concern about multi-hop capabilities. Fibre Channel has its own limitations, and FCoE currently does not permit multi-hop configurations. This is scheduled to be ratified in the next standards revision, but is certainly a legitimate concern for those interested in implementing FCoE in any sizable deployment.

Still another is the real comparison between FCoE and 10Gb iSCSI. For those who simply want a wicked fast storage environment iSCSI has been maturing nicely, showing incredible performance when tweaked properly, and there are no shortage of technical gurus to help companies get to where they want to go.

Additionally, and perhaps most importantly, is the aspect of the cultural changes that are required for FCoE deployment. Perhaps most glaringly is the simple fact that cross-functional planning is required across teams that are traditionally heavily fortified silos. Had Gartner focused on the cultural implications, rather than some very bizarre technical claims, they would have remained on pretty solid ground.

Myopic Strawmen

Gartner misses some of the most important potential benefits of FCoE and the reason why it’s been getting so much buzz.

The operating cost reduction (see the cable example above) and asset reduction (see the NIC and HBA reduction example above) are not the only benefits of FCoE. There are two things, in particular, that are quite interesting about the technology that you don’t hear a lot about (and are worth exploring in detail on their own at a later date).

First, the actual implementation of FCoE, both from a frame and a transmission (PFC) standpoint, is incredibly simple and modular. It’s abstracted nature means portability and flexibility, just like any other virtualized environment. The full ramifications of what is possible haven’t even been fully explored, let alone tested. With simple building blocks you can create some very customizable solutions.

Second, there is the quietly looming benefit of Enhanced Transmission Service (ETS). In a nutshell, PFC – which separates out the pause flow control mechanism into multiple traffic classes – doesn’t provide a way to associate different traffic classes with priority levels or limit bandwidth. ETS is a way to address that, by classifying traffic, queueing traffic, and more granular transmission selection.

While detailed description of ETS goes beyond the scope of this particular post, the important point is this: by taking these basic building blocks and the implicit flexibility of delivery, LAN and SAN administrators have an incredibly powerful ability to customize their networks and tune them precisely how they want them to be.

Conclusion

Ultimately, anyone who believes that FCoE is the Second Coming needs to stop drinking with Jim Jones. There’s no question that this is not at the level of maturity of either of its underlying technologies, but nor does anyone who is looking at it seriously suggest that it is.

Gartner’s Chicken Little approach seems to be a way of garnering (pun intended) attention with respect to FCoE, but does it in a way that seems to indicate that either they’re not completely familiar with the technology or simply needs to feel that it’s important to be contrarian.

FCoE has numerous obstacles to overcome, but to dismiss the “promise” of a technology because that promises has yet to be fulfilled seems disingenuous.

You can subscribe to this blog to get notifications of future articles in the column on the right. You can also follow me on Twitter: @jmichelmetz