User / kernel splits (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Integration of SCST in the mainstream Linux kernel
Date: Mon, 04 Feb 2008 19:45:49 UTC
Message-ID: <fa.rahSOAYQUY9XcGF6fpobaRwkYEw@ifi.uio.no>

On Mon, 4 Feb 2008, Nicholas A. Bellinger wrote:
>
> While this does not have anything to do directly with the kernel vs.
> user discussion for target mode storage engine, the scaling and latency
> case is easy enough to make if we are talking about scaling TCP for 10
> Gb/sec storage fabrics.

I would like to point out that while I think there is no question that the
basic data transfer engine would perform better in kernel space, there
stll *are* questions whether

 - iSCSI is relevant enough for us to even care ...

 - ... and the complexity is actually worth it.

That said, I also tend to believe that trying to split things up between
kernel and user space is often more complex than just keeping things in
one place, because the trade-offs of which part goes where wll inevitably
be wrong in *some* area, and then you're really screwed.

So from a purely personal standpoint, I'd like to say that I'm not really
interested in iSCSI (and I don't quite know why I've been cc'd on this
whole discussion) and think that other approaches are potentially *much*
better. So for example, I personally suspect that ATA-over-ethernet is way
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and
low-level, and against those crazy SCSI people to begin with.

So take any utterances of mine with a big pinch of salt.

Historically, the only split that has worked pretty well is "connection
initiation/setup in user space, actual data transfers in kernel space".

Pure user-space solutions work, but tend to eventually be turned into
kernel-space if they are simple enough and really do have throughput and
latency considerations (eg nfsd), and aren't quite complex and crazy
enough to have a large impedance-matching problem even for basic IO stuff
(eg samba).

And totally pure kernel solutions work only if there are very stable
standards and no major authentication or connection setup issues (eg local
disks).

So just going by what has happened in the past, I'd assume that iSCSI
would eventually turn into "connecting/authentication in user space" with
"data transfers in kernel space". But only if it really does end up
mattering enough. We had a totally user-space NFS daemon for a long time,
and it was perfectly fine until people really started caring.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Integration of SCST in the mainstream Linux kernel
Date: Mon, 04 Feb 2008 21:26:34 UTC
Message-ID: <fa.ABx1L+sQBrH6Niu0n8lJbmJavlQ@ifi.uio.no>

On Mon, 4 Feb 2008, J. Bruce Fields wrote:
>
> I'd assumed the move was primarily because of the difficulty of getting
> correct semantics on a shared filesystem

.. not even shared. It was hard to get correct semantics full stop.

Which is a traditional problem. The thing is, the kernel always has some
internal state, and it's hard to expose all the semantics that the kernel
knows about to user space.

So no, performance is not the only reason to move to kernel space. It can
easily be things like needing direct access to internal data queues (for a
iSCSI target, this could be things like barriers or just tagged commands -
yes, you can probably emulate things like that without access to the
actual IO queues, but are you sure the semantics will be entirely right?

The kernel/userland boundary is not just a performance boundary, it's an
abstraction boundary too, and these kinds of protocols tend to break
abstractions. NFS broke it by having "file handles" (which is not
something that really exists in user space, and is almost impossible to
emulate correctly), and I bet the same thing happens when emulating a SCSI
target in user space.

Maybe not. I _really_ haven't looked into iSCSI, I'm just guessing there
would be things like ordering issues.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Integration of SCST in the mainstream Linux kernel
Date: Mon, 04 Feb 2008 23:46:33 UTC
Message-ID: <fa.T+gwJ8m4ThZES5sRcmScA6IEseg@ifi.uio.no>

On Mon, 4 Feb 2008, Jeff Garzik wrote:
>
> Well, speaking as a complete nutter who just finished the bare bones of an
> NFSv4 userland server[1]...  it depends on your approach.

You definitely are a complete nutter ;)

> If the userland server is the _only_ one accessing the data[2] -- i.e. the
> database server model where ls(1) shows a couple multi-gigabyte files or a raw
> partition -- then it's easy to get all the semantics right, including file
> handles.  You're not racing with local kernel fileserving.

It's not really simple in general even then. The problems come with file
handles, and two big issues in particular:

 - handling a reboot (of the server) without impacting the client really
   does need a "look up by file handle" operation (which you can do by
   logging the pathname to filehandle translation, but it certainly gets
   problematic).

 - non-Unix-like filesystems don't necessarily have a stable "st_ino"
   field (ie it may change over a rename or have no meaning what-so-ever,
   things like that), and that makes trying to generate a filehandle
   really interesting for them.

I do agree that it's possible - we obviously _did_ have a user-level NFSD
for a long while, after all - but it's quite painful if you want to handle
things well. Only allowing access through the NFSD certainly helps a lot,
but still doesn't make it quite as trivial as you claim ;)

Of course, I think you can make NFSv4 to use volatile filehandles instead
of the traditional long-lived ones, and that really should avoid almost
all of the problems with doing a NFSv4 server in user space. However, I'd
expect there to be clients that don't do the whole volatile thing, or
support the file handle becoming stale only at certain well-defined points
(ie after renames, not at random reboot times).

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Integration of SCST in the mainstream Linux kernel
Date: Tue, 05 Feb 2008 01:22:08 UTC
Message-ID: <fa.Jo27DKicRj6+aw6F4hfwe2LUt44@ifi.uio.no>

On Mon, 4 Feb 2008, Jeff Garzik wrote:
>
> Both of these are easily handled if the server is 100% in charge of managing
> the filesystem _metadata_ and data.  That's what I meant by complete control.
>
> i.e. it not ext3 or reiserfs or vfat, its a block device or 1000GB file
> managed by a userland process.

Oh ok.

Yes, if you bring the filesystem into user mode too, then the problems go
away - because now your NFSD can interact directly with the filesystem
without any kernel/usermode abstraction layer rules in between. So that
has all the same properties as moving NFSD entirely into the kernel.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Integration of SCST in the mainstream Linux kernel
Date: Mon, 04 Feb 2008 23:28:56 UTC
Message-ID: <fa.qW+Q6DHa4vjGsltD2iWBEx9wjlM@ifi.uio.no>

On Mon, 4 Feb 2008, Jeff Garzik wrote:
>
> For years I have been hoping that someone will invent a simple protocol (w/
> strong auth) that can transit ATA and SCSI commands and responses. Heck, it
> would be almost trivial if the kernel had a TLS/SSL implementation.

Why would you want authorization? If you don't use IP (just ethernet
framing), then 99% of the time the solution is to just trust the subnet.

So most people would never want TLS/SSL, and the ones that *do* want it
would probably also want IP routing, so you'd actually be better off with
a separate higher-level bridging protocol rather than have TLS/SSL as part
of the actual packet protocol.

So don't add complexity. The beauty of ATA-over-ethernet is exactly that
it's simple and straightforward.

(Simple and straightforward is also nice for actually creating devices
that are the targets of this. I just *bet* that an iSCSI target device
probably needs two orders of magnitude more CPU power than a simple AoE
thing that can probably be done in an FPGA with no real software at all).

Whatever. We have now officially gotten totally off topic ;)

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Integration of SCST in the mainstream Linux kernel
Date: Tue, 05 Feb 2008 00:25:53 UTC
Message-ID: <fa.VQ9hecSLDLK+viijtCbIxcn+Gu0@ifi.uio.no>

On Mon, 4 Feb 2008, Matt Mackall wrote:
>
> But ATAoE is boring because it's not IP. Which means no routing,
> firewalls, tunnels, congestion control, etc.

The thing is, that's often an advantage. Not just for performance.

> NBD and iSCSI (for all its hideous growths) can take advantage of these
> things.

.. and all this could equally well be done by a simple bridging protocol
(completely independently of any AoE code).

The thing is, iSCSI does things at the wrong level. It *forces* people to
use the complex protocols, when it's a known that a lot of people don't
want it.

Which is why these AoE and FCoE things keep popping up.

It's easy to bridge ethernet and add a new layer on top of AoE if you need
it. In comparison, it's *impossible* to remove an unnecessary layer from
iSCSI.

This is why "simple and low-level is good". It's always possible to build
on top of low-level protocols, while it's generally never possible to
simplify overly complex ones.

		Linus

Index Home About Blog