The Grand Unification Storage Theory

In Storage, Technology by J Michel Metz4 Comments

CC000450

I read with interest an article on The Register about a panel that happened at TECHunplugged (I wish I had known about this; I would have tried to attend in person if possible). The article was entitled “One Storage Protocol to Rule Them All” by Chris Evans, who is someone I like and respect a great deal in the industry.

The key clickbait for me, however, was the interesting tagline, “And why are Fibre Channel fans so, er, stubborn about movements?” Being heavily involved in both Fibre Channel and Fibre Channel over Ethernet (as I write this, I’m on the Board of Directors for the Fibre Channel Industry Association, though this is just my thoughts, not the FCIA’s), I was keen to see just how aligned his thoughts and mine might be.

Overall, I believe Chris’ article is spot-on when it comes to breaking down the various component parts and why there are different ways of addressing those parts inside a data center.

I do think, however, that the article overreaches a bit about the motivations and reasons why people use Fibre Channel (FC) and Fibre Channel over Ethernet (FCoE). Since that was what drew me to the article in the first place, I’ll (mostly) examine that part here.

The Fibre Channel “Problem”

As I said, the article does a great job of identifying the pieces of the puzzle, and even placing the technologies into appropriate pigeonholes:

At the transport layer, modern systems use either Ethernet, Fibre Channel or FICON. FICON is restricted to the IBM mainframe domain and we can set that aside as part of this discussion. As a dedicated transport, Fibre Channel has significant benefits over Ethernet, such as lossless delivery but was designed to be a local protocol and needs extensions to work over wide area networks.

This is true. Unlike Ethernet, Fibre Channel was specifically designed to be a storage protocol. As recently as ten years ago, Ethernet was – by way of comparison – extremely slow and expensive to make work for reliable, high-performance storage. FC, on the other hand, was 100% focused on making sure that storage data was given first priority.

(I say “first” priority because at one point in time FC was so good, and so fast, it was further developed to use it as a platform for sending IP traffic as well, but the idea didn’t take hold).

Embedded in the quote above, however, is a slightly misleading implication: that FC didn’t have the foresight to be extendible over wide area networks and, as such, is/was a flaw in the design. Perhaps this was not an intent, or perhaps I’m just inferring that interpretation, but it does underscore a common misunderstanding of storage networks: typically storage networks are designed to be as close to the host as possible, and so we are extending the attachability for hosts outward, not “reverse engineered” from massively scaled systems inward.

In any case, it’s a minor nit and not worth expanding upon, except as a context for evaluating this next statement:

In order to try and rationalise protocols and transports, we could focus on Ethernet. However as already mentioned, there are issues with standard Ethernet and that was meant to be addressed with Data Centre Bridging and FCoE. Unfortunately for the proponents of this technology, traditional Fibre Channel remains remarkably stubborn to shift, which is not surprising for any technology that has such a huge investment in it from end users.

I’m going to play “Connect the Dots,” here, so bear with me a moment.

Fibre Channel was designed to be a storage network. That’s all it really does, and it does it very, very well. It’s reputation is well-deserved for reliability, availability, and security. It’s an extremely high-performance protocol (and network – it’s got it’s own stack from the hardware through the software layers) that – as of this writing – has a rock-solid track record.

two directions

This is because FC is a deterministic network. That is, in a Fibre Channel network we understand the relationships between hosts and servers before they are ever connected. It requires its own way of thinking about storage, data, and networking, and it’s very different than Ethernet’s non-deterministic architecture.

Think about it this way: In an Ethernet system we do not plan the nature of the relationship between devices. We create systems at upper layers to do this for us (DHCP and NAT come to mind), so that all we have to do is make a device addressable and available, and then the services do the rest to organize how they connect and maintain connections.

In Fibre Channel, it’s the opposite – each device has an address and must be planned manually to talk to another device (in a process called Zoning). That relationship does not change, and there is no room for arbitrary configurations.

This isn’t the place to talk about the specifics of how this stuff works; it’s sufficient to say that while it’s not quite as versatile to run arbitrary network topologies, it is extremely reliable and predictable.

Because of this, Fibre Channel is unlike anything we’d seen in Ethernet networks up until around 2009/2010. Up until then there really wasn’t much of an issue because Fibre Channel speeds were up to 8x that of typical Ethernet networks.

Then it all changed.

The Rise of Ethernet

dwtb-data-centers-2014q1-fig1Even though 10GbE began to emerge in the mid-2000s, it wasn’t until 2010 or so that the numbers really started to take off. More advances came astonishingly quickly, and soon 40GbE and 100GbE were becoming more common as backbones inside of data centers.

With these advances in throughput came some obvious questions – can we use all this bandwidth for everything, not just Ethernet traffic? Could we possibly place our deterministic storage traffic onto a medium that is, historically, non-deterministic?

If we’re going to do that, the logic dictates, we’re going to have to make sure that we can make Ethernet deterministic. So, that’s why two Ethernet protocols were developed in the IEEE: Enhanced Transmission Selection and Priority Flow Control (for the geeks out there, it’s IEEE 801.1Qaz and 801.1Qbb, respectively). In layman’s terms, they carve up the bandwidth for different types of traffic (so that one doesn’t hog all of the bandwidth and starve the other), and make it possible for traffic to be lossless and deterministic (just like Fibre Channel).

(For the record, both of these protocols are used in non-FC environments too, for instance lossless iSCSI and RoCE.)

Fibre Channel over Ethernet (FCoE), then, is the ability to take Fibre Channel traffic and encapsulate it into an Ethernet packet, and send it in its own “lane” with its own deterministic properties, and not affect (or be affected by) other Ethernet traffic. Aside from the Ethernet wire itself and the encapsulation, everything about FCoE is exactly the same – the architecture, design, protocol, management, etc.

And this is where I must disagree a bit with Chris’ assessment (but not entirely):

Unfortunately for the proponents of this technology, traditional Fibre Channel remains remarkably stubborn to shift, which is not surprising for any technology that has such a huge investment in it from end users. Of course by that I’m not specifically referring to hardware, but also the knowledge and experience in building storage networks that has to be relearned.

FCoE was, and has always been, intended to be inserted into data center architectures where necessary. Like iSCSI, it is a storage service that is enabled on Ethernet networks, but unlike iSCSI, it’s possible to be enabled on a link-by-link basis. This is the part that confuses (to this day) a lot of people.

And this is the part where I both agree and disagree with Chris’ assessment. Many people (Chris included, I think, though I don’t mean to put words in his mouth) think that unless there is an “all-FCoE” storage network, it doesn’t “count.” That is, as Chris points out in the article, the accusation is that the protocol “wasn’t successful.” This, despite the fact that there are millions of FCoE hosts (which is where the greatest cost benefits of FCoE happen to be). Chris, like many others, only think a protocol counts if the storage device happens to be running it – a shortsighted view, IMO.

I think, though, that Chris identifies a major obstacle for a lot of people – that there is a perception that there must be a different skill set necessary for running FCoE versus FC at the protocol level. Even though this isn’t the case, persuading a culture of people who are used to a fixed approach to “doing” storage is difficult.

The Grand Unification Storage Protocol

At the moment I am observing two specific, but diametrically opposite, trends in storage (at a high level).

On the one hand you have the expansion into massively scalable, object-based systems that are designed to be accessible by any device anywhere in the world. Technology (and physics, not to mention common sense) mandates that this is not going to be the panacea for all storage problems, because it necessitates moving storage devices further away from the hosts that need to access them in order to sustain that scalability.

The second trend is a move towards faster and faster storage. Non-Volatile Memory (NVM) and its corresponding protocol, NVM Express, are making stunning advances in reducing latency. Think your Flash/SSD is fast? You ain’t seen nothing yet. This necessitates moving the storage closer to the hosts to capitalize on that performance.

It makes very little sense to try to place a large-scale, high-latency object protocol on ever-faster storage devices, ad infinitum. If I have a storage protocol that has a tenth of a second latency, it doesn’t matter if I have a device that has a thousandth of a second latency or ten-thousandth of a second latency (so why use the faster, and more expensive, storage?).

As these storage devices continue to get faster, the network becomes a bigger player in performance problems. So, non-deterministic storage networks (like file and object) will become less-and-less viable. Deterministic storage systems, like Fibre Channel, InfiniBand, and RoCE, are going to be more and more relevant in cases like this.

Bottom Line

Personally, I don’t like the question of a “one storage protocol to rule them all,” because it fails to account for the consequences of such a thing. People talk about avoiding “lock-in” all the time, but for some reason the religious wars over the “one true storage protocol” seems immune to that thought process.

sakIt seems to me that it’s a good thing to have many different types of tools in the toolbox. Sure, the Swiss Army Knife is versatile and good to have because it’s got the ability to be useful in many situations, but not as a major tool for building a house (or rebuilding Jeeps). Sure, it’s can do a lot of things, but what are the limitations of how well any one of those things are done?

Storage is the same way. I know that there are efforts to make object storage look and act like Block, but to me (and this is my personal, humble opinion) it’s similar to the Swiss Army Knife example. It is impossible to get NVMe performance (for instance) on a Block-Emulation protocol that is negates any advantage of the advanced media, compared to a native Block storage protocol.

I’m a fan of using the right tool for the job, and having the ability to choose different options that suit my needs. I think that while a Single Storage Protocol may seem like a good idea for some people and some uses, it wouldn’t address the broad spectrum of problems that storage and storage networks need to resolve.

Comments

  1. J, nice article. In fact what I meant on that last section wasn’t that you couldn’t have FCoE and FC, just that many storage people are somewhat bigoted in their outlook and there’s always been a bit of contention with the networking teams to the extent that both like to be in control. Unfortunately both have their own design goals that don’t always match up! The stubbornness is of the people as much as anything else. 🙂

    1. This is another point where we completely agree. Most storage conflicts have nothing to do with the technology, it’s a total “Layer 8” problem. Or, in FC terms, a “Layer 5” problem. 😉

  2. Pingback: Storage Forces | J Metz's Blog

Leave a Comment