Index
Home
About
Blog
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Sat, 05 Dec 2009 21:58:49 UTC
Message-ID: <fa.93gwJDm9u2I+pC+qEZxbNuu0oHo@ifi.uio.no>
On Sat, 5 Dec 2009, Linus Torvalds wrote:
>
> In other words - I'm not pulling this crazy thing. You'd better explain
> why it was done that way, when we already have done the same things better
> before in different ways.
I get the feeling that all the crazy infrastructure was due to worrying
about the suspend/resume topology.
But the reason we don't worry about that during init is that it doesn't
really tend to matter. Most slow operations are the things that aren't
topology-aware, ie things like spinning up/down disks etc, that really
could be done as a separate phase instead.
For example, is there really any reason why resume doesn't look exactly
like the init sequence? Drivers that do slow things can start async work
to do them, and then at the end of the resume sequence we just do a "wait
for all the async work", exactly like we do for the current init
sequences.
And yes, for the suspend sequence we obviously need to do any async work
(and wait for it) before we actually shut down the controllers, but that
would be _way_ more natural to do by just introducing a "pre-suspend" hook
that walks the device tree and does any async stuff. And then just wait
for the async stuff to finish before doing the suspend, and perhaps again
before doing late_suspend (maybe somebody wants to do async stuff at the
second stage too).
Then, because we need a way to undo things if things go wrong in the
middle (and because it's also nice to be symmetric), we'd probably want to
introduce that kind of "post_resume()" callback that allows you have a
separate async wakeup thing for resume time too.
What are actually the expensive/slow things during suspend/resume? Am I
wrong when I say it's things like disk spinup/spindown (and USB discovery,
which needs USB-level support anyway, since it can involve devices that we
didn't even know about before discovery started).
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Sun, 06 Dec 2009 00:48:50 UTC
Message-ID: <fa.NgV8Cfr7CUj82YVh4P+T8lPbaH4@ifi.uio.no>
On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> The approach you're suggesting would require modifying individual drivers which
> I just wanted to avoid.
In the init path, we had the reverse worry - not wanting to make
everything (where "everything" can be some subsystem like just the set of
PCI drivers, of course - not really "everything" in an absolute sense)
async, and then having to try to work out with the random driver that
couldn't handle it.
And there were _lots_ of drivers that couldn't handle it, because they
knew they got woken up serially. The ATA layer needed to know about
asynchronous things, because sometimes those independent devices aren't so
independent at all. Which is why I don't think your approach is safe.
Just to take an example of the whole "independent devices are not
necessarily independent" thing - things like multi-port PCMCIA controllers
generally show up as multiple PCI devices. But they are _not_ independent,
and they actually share some registers. Resuming them asynchronously might
well be ok, but maybe it's not. Who knows?
In contrast, a device driver can generally know that certain _parts_ of
the initialization is safe. As an example of that, I think the libata
layer does all the port enumeration synchronously, but then once the ports
have been identified, it does the rest async.
That's the kind of decision we can sanely make when we do the async part
as a "drivers may choose to do certain parts asynchronously". Doing it at
a higher level sounds like a problem to me.
> If you don't like that, we'll have to take the longer route, although
> I'm afraid that will take lots of time and we won't be able to exploit
> the entire possible parallelism this way.
Sure. But I'd rather do the safe thing. Especially since there are likely
just a few cases that really take a long time.
> During suspend we actually know what the dependences between the devicces
> are and we can use that information to do more things in parallel. For
> instance, in the majority of cases (I'm yet to find a counter example), the
> entire suspend callbacks of "leaf" PCI devices may be run in parallel with each
> other.
See above. That's simply not at all guaranteed to be true.
And when it isn't true (ie different PCI leaf devices end up having subtle
dependencies), now you need to start doing hacky things.
I'd much rather have the individual drivers say "I can do this part in
parallel", and not force it on them. Because it is definitely _not_
guaranteed that PCI devices can do parallel resume and suspend.
> Yes, we can do that, but I'm afraid that the majority of drivers won't use the
> new hooks (people generally seem to be to reluctant to modify their
> suspend/resume callbacks not to break things).
See above - I don't think this is a "majority" issue. I think it's a
"let's figure out the problem spots, and fix _those_". IOW, get 2% of the
coverage, and get 95% of the advantage.
> Disk spinup/spindown takes time, but also some ACPI devices resume slowly,
We actually saw that when we did async init. And it was horrible. There's
nothing that says that the ACPI stuff necessarily even _can_ run in
parallel.
I think we currently only do the ACPI battery ops asynchronously.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Sun, 06 Dec 2009 02:05:34 UTC
Message-ID: <fa.YWacREkvBPJq5D9zOsV5wuehBAw@ifi.uio.no>
On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> While the current settings are probably unsafe (like enabling PCI devices
> to be suspended asynchronously by default if there are not any direct
> dependences between them), there are provisions to make eveything safe, if
> we have enough information (which also is needed to put the required logic into
> the drivers).
I disagree.
Think of a situation that we already handle pretty poorly: USB mass
storage devices over a suspend/resume.
> The device tree represents a good deal of the dependences
> between devices and the other dependences may be represented as PM links
> enforcing specific ordering of the PM callbacks.
The device tree means nothing at all, because it may need to be entirely
rebuilt at resume time.
Optimally, what we _should_ be doing (and aren't) for suspend/resume of
USB is to just tear down the whole topology and rebuild it and re-connect
the things like mass storage devices. IOW, there would be no device tree
to describe the topology, because we're finding it anew. And it's one of
the things we _would_ want to do asynchronously with other things.
We don't want to build up some irrelevant PM links and callbacks. We don't
want to have some completely made-up new infrastructure for something that
we _already_ want to handle totally differently for init time.
IOW, I argue very strongly against making up something PM-specific, when
there really doesn't seem to be much of an advantage. We're much better
off trying to share the init code than making up something new.
> I'd say if there's a worry that the same register may be accessed concurrently
> from two different code paths, there should be some locking in place.
Yeah. And I wish ACPI didn't exist at all. We don't know.
And we want to _limit_ our exposure to these things.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Sun, 06 Dec 2009 00:52:54 UTC
Message-ID: <fa.eYp5PXjYEkovybUyZV1YpWhD3Qg@ifi.uio.no>
On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> > It all looks terminally broken: you force async suspend for all PCI
> > drivers, even when it makes no sense.
>
> I'm not exactly sure what you're referring to. The async suspend is not
> forced, it just tells the PM core that it can execute PCI suspend/resume
> callbacks in parallel as long as the devices in question don't depend on each
> other.
That's exactly what I mean by forcing async suspend/resume.
You don't know the ordering rules for PCI devices. Multi-function PCI
devices commonly share registers - they're on the same chip, after all.
And even when the _hardware_ is totally independent, we often have
discovery rules and want to initialize in order because different drivers
will do things like unregister entirely on suspend, and then re-register
on resume.
Imagine the mess when two ethernet devices randomly end up coming up with
different names (eth0/eth1) depending on subtle timing issues.
THAT is why we do things in order. Asynchronous programming is _hard_.
Just deciding that "all PCI devices can always be resumed and suspended
asynchronously" is a much MUCH bigger decision than you seem to have
even realized.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Sun, 06 Dec 2009 01:50:32 UTC
Message-ID: <fa.viykiFq7bzWbbDVDfRyGa5Nqkk8@ifi.uio.no>
On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> > Multi-function PCI devices commonly share registers - they're on the same
> > chip, after all. And even when the _hardware_ is totally independent, we
> > often have discovery rules and want to initialize in order because different
> > drivers will do things like unregister entirely on suspend, and then
> > re-register on resume.
>
> Do any of the PCI drivers do that?
It used to be common at least for ethernet - there were a number of
drivers that essentially did the same thing on suspend/resume and on
module unload/reload.
The point is, I don't know. And neither do you. It's much safer to just do
drivers one by one, and not touch drivers that people don't test.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 05:58:12 UTC
Message-ID: <fa.ZWO+mHQuUcvzrim2em/vuVIqTbw@ifi.uio.no>
On Mon, 7 Dec 2009, Zhang Rui wrote:
>
> Hi, Linus,
> can you please look at this patch set and see if the idea is right?
> http://marc.info/?l=linux-kernel&m=124840449826386&w=2
> http://marc.info/?l=linux-acpi&m=124840456826456&w=2
> http://marc.info/?l=linux-acpi&m=124840456926459&w=2
> http://marc.info/?l=linux-acpi&m=124840457026468&w=2
> http://marc.info/?l=linux-acpi&m=124840457126471&w=2
So I'm not entirely sure about that patch-set, but the thing I like about
it is how drivers really sign up to it one by one, rather than having all
PCI devices automatically signed up for async behavior.
That said, the thing I don't like about it is some of the same thing I
don't necessarily like about the series in Rafael's tree either: it looks
rather over-designed with the whole infrastructure for async device logic
(your patch in http://marc.info/?l=linux-acpi&m=124840456926459&w=2). How
would you explain that whole async_dev_register() logic in simple terms to
somebody else?
(I think yours is simpler that the one in the PM tree, but I dunno. I've
not really compared the two).
So let me explain my dislike by trying to outline some conceptually simple
thing that doesn't have any call-backs, doesn't have any "classes",
doesn't require registration etc. It just allows drivers at any level to
decide to do some things (not necessarily everything) asynchronously.
Here's the outline:
- first off: drivers that don't know that they nest clearly don't do
anything asynchronous. No "PCI devices can be done in parallel" crap,
because they really can't - not in the general case. So just forget
about that kind of logic entirely: it's just wrong.
- the 'suspend' thing is a depth-first tree walk. As we suspend a node,
we first suspend the child nodes, and then we suspend the node itself.
Everybody agrees about that, right?
- Trivial "async rule": the tree is walked synchronously, but as we walk
it, any point in the tree may decide to do some or all of its suspend
asynchronously. For example, when we hit a disk node, the disk driver
may just decide that (a) it knows that the disk is an independent thing
and (b) it's hierarchical wrt it's parent so (c) it can do the disk
suspend asynchronously.
- To protect against a parent node being suspended before any async child
work has completed, the child suspend - before it kicks off the actual
async work - just needs to take a read-lock on the parent (read-lock,
because you may have multiple children sharing a parent, and they don't
lock each other out). Then the only thing the asynchronous code needs
to do is to release the read lock when it is done.
- Now, the rule just becomes that the parent has to take a write lock on
itself when it suspends itself. That will automatically block until
all children are done.
Doesn't the above sound _simple_? Doesn't that sound like it should just
obviously do the right thing? It sounds like something you can explain as
a locking rule without having any complex semantic registration or
anything at all.
Now, the problem remains that when you walk the device tree starting off
all these potentially asynchronous events, you don't want to do that
serialization part (the "parent suspend") as you walk the tree - because
then you would only ever do one single level asynchronously. Which is why
I suggested splitting the suspend into a "pre-suspend" phase (and a
"post-resume" one). Because then the tree walk goes from
# single depth-first thing
suspend(root)
{
for_each_child(root) {
// This may take the parent lock for
// reading if it does something async
suspend(child);
}
// This serializes with any async children
write_lock(root->lock);
suspend_one_node(root);
write_unlock(root->lock);
}
to
# Phase one: walk the tree synchronously, starting any
# async work on the leaves
suspend_prepare(root)
{
for_each_child(root) {
// This may take the parent lock for
// reading if it does something async
suspend_prepare(child);
}
suspend_prepare_one_node(root);
}
# Phase two: walk the tree synchronously, waiting for
# and finishing the suspend
suspend(root)
{
for_each_child(root) {
suspend(child);
}
// This serializes with any async children started in phase 1
write_lock(root->lock);
suspend_one_node(root);
write_unlock(root->lock);
}
and I really think this should work.
The advantage: untouched drivers don't change ANY SEMANTICS AT ALL. If
they don't have a 'suspend_prepare()' function, then they still see that
exact same sequence of 'suspend()' calls. In fact, even if they have
children that _do_ have drivers that have that async phase, they'll never
know, because that simple write-semaphore trivially guarantees that
whether there was async work or not, it will be completed by the time we
call 'suspend()'.
And drivers that want to do things asynchronously don't need to register
or worry: all they do is literally
- move their 'suspend()' function to 'suspend_prepare()' instead
- add a
down_read(dev->parent->lock);
async_run(mysuspend, dev);
to the point that they want to be asynchronous (which may be _all_ of
it or just some slow part). The 'mysuspend' part would be the async
part.
- add a
up_read(dev->parent->lock);
to the end of their asynchronous 'mysuspend()' function, so that when
the child has finished suspending, the parent down_write() will finally
succeed.
Doesn't that all sound pretty simple? And it has a very clear architecture
that is pretty easy to explain to anybody who knows about traversing trees
depth-first.
No complex concepts. No change to existing tested drivers. No callbacks,
no flags, no nothing. And a pretty simple way for a driver to decide: I'll
do my suspends asynchronously (without parent drivers really ever even
having to know about it).
I dunno. Maybe I'm overlooking something, but the above is much closer to
what I think would be worth doing.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 06:15:50 UTC
Message-ID: <fa.l/rozVfpYczisjfnk2m7ogMr9I8@ifi.uio.no>
On Sun, 6 Dec 2009, Linus Torvalds wrote:
>
> And drivers that want to do things asynchronously don't need to register
> or worry: all they do is literally [...]
Side note: for specific bus implementations, you obviously don't have to
even expose the choice. Things like the whole "suspend_late" and
"resume_early" phases don't make sense for USB devices, and the USB core
layer don't even expose those to the various USB drivers.
The same is true of the prepare_suspend/suspend split I'm proposing: I
suspect that for something like USB, it would make most sense to just do
normal node suspend in prepare_suspend, which would do everything
asynchronously. Only USB hub devices would get involved at the later
'suspend()' phase.
So I'm not suggesting that "all drivers" would necessarily even need
changing in order to take advantage of asynchronous behavior.
You could change just the _core_ USB layer would do everything
automatically for USB devices, and now USB devices would automatically
suspend asynchronously not because the generic device layer knows about
it, but because the USB bus layer chose to do that "async_run()" on the
leaf node suspend functions (or rather: a helper function that calls the
leaf-node suspend, and then does the 'up_read()' call on the parent
lock: the actual usb driverrs would never know about any of this).
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 05:21:10 UTC
Message-ID: <fa.EBsEUqzFEDao6nMOu7OSLTGhHJc@ifi.uio.no>
On Sun, 6 Dec 2009, Alan Stern wrote:
>
> That's ridiculous. Having gone to all the trouble of building a device
> tree, one which is presumably still almost entirely correct, why go to
> all the trouble of tearing it down only to rebuild it again? (Note:
> I'm talking about resume-from-RAM here, not resume-from-hibernation.)
Hey, I can believe that it's worth keeping the USB device tree, and just
validating it instead. However:
> If I understand correctly, what you're suggesting is impractical. You
> would have each driver responsible for resuming the devices it
> registers.
The thing is, for 99% of all devices, we really _really_ don't care.
Especially PCI devices.
Your average laptop will have something like ten PCI devices on it, and
99% of those have no delays at all outside of the millisecond-level ones
that it takes for power management register writes etc to take place.
So what I'm suggesting is to NOT DO ANY ASYNC RESUME AT ALL by default.
Because async device management is _hard_, and results in various nasty
ordering problems that are timing-dependent etc. And it's totally
pointless for almost all cases.
This is why I think it is so crazy to try to create those idiotic "this
device depends on that other" lists etc - it's adding serious conceptual
complexity for something that nobody cares about, and that just allows for
non-deterministic behavior that we don't even want.
> So consider this suggestion: Let's define PM groups.
Let's not.
I can imagine that doing USB resume specially is worth it, since USB is
fundamentally a pretty slow bus. But USB is also a fairly clear hierarchy,
so there is no point in PM groups or any other information outside of the
pure topology.
But there is absolutely zero point in doing that for devices in general.
PCI drivers simply do not want concurrent initialization. The upsides are
basically zero (win a few msecs?) and the downsides are the pointless
complexity. We don't do PCI discovery asyncronously either - for all the
same reasons.
Now, a PCI driver may then implement a bus that is slow (ie SCSI, ATA,
USB), and that bus may itself then want to do something else. If it really
is a good idea to add the whole hierarchical model to USB suspend/resume,
I can live with that, but that is absolutely no excuse for then doing it
for cases where the hierarchy is (a) known to be broken (ie the whole PCI
multifunction thing, but also things like motherboard power management
devices) and (b) don't have the same kind of slow bus issues.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 16:32:26 UTC
Message-ID: <fa.UWYnx+DfUZ4UMzo+uvk3Gaq/D8E@ifi.uio.no>
On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > I dunno. Maybe I'm overlooking something, but the above is much closer to
> > what I think would be worth doing.
>
> You're overlooking resume. It's more difficult than suspend. The
> issue is that a child can't start its async part until the parent's
> synchronous part is finished.
No, I haven't overlooked resume at all. I just assumed that it was
obvious. It's the exact same thing, except in reverse (the locking ends
up being slightly different, but the changes are actually fairly
straightforward).
And by reverse, I mean that you walk the tree in the reverse order too,
exactly like we already do - on suspend we walk it children-first, on
resume we walk it parents-first (small detail: we actually just walk a
simple linked list, but the list is topologically ordered, so walking it
forwards/backwards is topologically the same thing as doing that
depth-first search).
> So for example, suppose the device listing contains P, C, Q, where C is
> a child of P, Q is unrelated, and P has a long-lasting asynchronous
> requirement. The resume process will stall upon reaching C, waiting
> for P to finish. Thus even though P and Q might be able to resume in
> parallel, they won't get the chance.
No. The resume process does EXACTLY THE SAME THING as I outlined for
suspend, but just all in reverse. So now the resume process becomes the
same two-phase thing:
# Phase one
resume(root)
{
// This can do things asynchronously if it wants,
// but needs to take the write lock on itself until
// it is done if it does
resume_one_node(root);
for_each_child(root)
resume(child);
}
# Phase two
post_resume(root)
{
post_resume_one_node(root);
for_each_child(root)
post_resume(child);
}
Notice? It's _exactly_ the same thing as suspend - except all turned
around. We do the nodes before the children ("walk the list backwards"),
and we also do the locking the other way around (ie on suspend we'd lock
the _parent_ if we wanted to do async stuff - to keep it around - but on
resume we lock _ourselves_, so that the children can have something to
wait on. Also note how we take a _write_ lock rather than a read lock).
(And again, I've only written it out in email, I've not tested it or
thought about it all that deeply, so you'll excuse any stupid thinkos.)
Now, for something like PCI, I'd suggest (once more) leaving all drivers
totally unchanged, and you end up with the exact same behavior as we had
before (no real change to the whole resume ordering, and everything is
synchronous so there is no relevant locking).
But how would the USB layer do this?
Simple: all the normal leaf devices would have their resume callback be
called at "post_resume()" time (exactly the reverse of the suspend phase:
we suspend early, and we resume late - it's all a mirror image). And I'd
suggest that the USB layer do it all totally asynchronously, except again
turned around the other way.
Remember how on suspend, the suspend of a leaf device ended up being an
issue of asynchronously calling a function that did the suspend, and then
released the read-lock of the parent. Resume is the same, except now we'd
actually want to take the parent read-lock asynchronously too, so you'd do
down_write(leaf->lock);
async_schedule(usb_node_resume, leaf);
where that function simply does
usb_node_resume(node)
{
/* Wait for the parent to have resumed completely */
down_read(node->parent->lock);
node->resume(node)
up_read(node->parent->lock);
up_write(node->lock);
}
and you're all done. Once more the ordering and the locking takes care of
any need to serialize - there is no data structures to keep track of.
And what about USB hubs? They get resumed in the first phase (again,
exactly the mirror image of the suspend), and the only thing they need to
do is that _exact_ same thing above:
down_write(hub->lock);
async_schedule(usb_node_resume, hub);
- Ta-daa! All done.
Notice? It's really pretty straightforward, and there are _zero_ new
concepts. And again, no callbacks, no nothing. Just the obvious mirror
image of what happened when suspending. We do everything with simple async
calls. And none of the tree walking actually blocks (yes, we do a
"down_write()" on the nodes as we schedule the resume code, but it won't
be a blocking one, since that is the first time we encounter that node:
the blocking will come later when the async threads actually need to wait
for things).
Again, I do not guarantee that I've dotted every i, and crossed every t.
It's just that I'm pretty sure that we really don't need any fancy
"infrastructure" for something this simple. And I really much prefer
"conceptually simple high-level model" over a model of "keep track of all
the relationships and have some complex model of devices".
So let's just look at your example:
> So for example, suppose the device listing contains P, C, Q, where C is
> a child of P, Q is unrelated, and P has a long-lasting asynchronous
> requirement.
The tree is:
... -> P -> C
-> Q
and with what I suggest, during phase one, P will asynchronously start the
resume. As part of its async resume it will have to wait for it's parents,
of course, but all of that happens in a separate context, and the tree
traversal goes on.
And during phase #1, C and Q won't do anything at all. We _could_ do them
during this phase, and it would actually all work out fine, but we
wouldn't want to do that for a simple reason: we _want_ the pre_suspend
and post_resume phases to be total mirror images, because if we end up
doing error handling for the pre-suspend case, then the post-resume phase
would be the "fixup" for it, so we actually want leaf things to happen
during phase #2 - not because it would screw up locking or ordering, but
because of other issues.
When we hit phase #2, we then do C and Q, and do the same thing - we have
an async call that does the read-lock on the parent to make sure it's
all resumed, and then we resume C and Q. And they'll automatically resume
in parallel (unless C is waiting for P, of course, in which case P and Q
end up resuming in parallel, and C ends up waiting).
Now, the above just takes care of the inter-device ordering. There are
unrelated semantics we want to give, like "all devices will have resumed
before we start waking up user space". Those are unrelated to the
topological requirements, of course, and are not a requirement imposed by
the device tree, but by our _other_ semantics (IOW, in this respect it's
kind of like how we wanted pre-suspend and post-resume to be mirror images
for other outside reasons).
So we'd actually have a "phase #3", but that phase wouldn't be visible to
the devices themselves, it would be a
# Phase tree: make sure everything is resumed
for_each_device() {
read_lock(dev->lock);
read_unlock(dev->lock);
}
but as you can see, there's no actual device callbacks involved. It would
be just the code device layer saying "ok, now I'm going to wait for all
the devices to have finished their resume".
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 16:56:11 UTC
Message-ID: <fa.oL3WWlFODyL7Yh3gPm/JupNyAqQ@ifi.uio.no>
On Mon, 7 Dec 2009, Linus Torvalds wrote:
>
> And during phase #1, C and Q won't do anything at all. We _could_ do them
> during this phase, and it would actually all work out fine, but we
> wouldn't want to do that for a simple reason: we _want_ the pre_suspend
> and post_resume phases to be total mirror images, because if we end up
> doing error handling for the pre-suspend case, then the post-resume phase
> would be the "fixup" for it, so we actually want leaf things to happen
> during phase #2 - not because it would screw up locking or ordering, but
> because of other issues.
Ho humm.
This part made me think. Since I started mulling over the fact that we
could do the resume thing in a single phase (and really only wanted the
second phase in order to be a mirror image to the suspend), I started
thinking that we could perhaps do even the suspend with a single phase,
and avoid introducing that pre-suspend/post-resume phase at all.
And now that I think about it, we can do that by simply changing the
locking just a tiny bit.
I originally envisioned that two-phase suspend because I was thinking that
the first phase would start off the suspend, and the second phase would
finish it, but we can actually do it all with a single phase that does
both. So starting with just the regular depth-first post-ordering that is
a suspend:
suspend(root)
{
for_each_child(root)
suspend(child);
suspend_one_node(root)
}
the rule would be that for something like USB that wants to do the suspend
asynchronously, the node suspend routine would do
usb_node_suspend(node)
{
// Make sure parent doesn't suspend: this will not block,
// because we'll call the 'suspend' function for all nodes
// before we call it for the parent.
down_read(node->parent->lock);
// Do the part that may block asynchronously
async_schedule(do_usb_node_suspend, node);
}
do_usb_node_suspend(node)
{
// Start out suspend. This will block if we have any
// children that are still busy suspending (they will
// have done a down_read() in their suspend).
down_write(node->lock);
node->suspend(node);
up_write(node->lock);
// This lets our parent continue
up_read(node->parent->lock);
}
and it looks like we don't even need a second phase at all.
IOW, I think USB could do this on its own right now, with no extra
infrastructure from the device layer AT ALL, except for one small thing:
that new "rwsem" lock in the device data structure, and then we'd need the
"wait for everybody to have completed" loop, ie
for_each_dev(dev) {
down_write(dev->lock);
up_write(dev->lock);
}
thing at the end of the suspend loop (same thing as I mentioned about
resuming).
So I think even that whole two-phase thing was unnecessarily complicated.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 18:06:34 UTC
Message-ID: <fa.IgXPMdbNJ1RQTeK7PxloFFqO7d4@ifi.uio.no>
On Mon, 7 Dec 2009, Alan Stern wrote:
>
> Okay, I think I've got it. But you're wrong about one thing: Resume
> isn't _exactly_ the reverse of suspend.
Yeah, no. But I think I made it much closer by getting rid of pre-suspend
and post-resume (my next email to the one you quoted).
And yeah, I started thinking along those lines exactly because it wasn't
as clean a mirror image as I thought it should be able to be.
> A non-async driver (i.e., most of them) would ignore the pre- pass
> entirely and do all its work in the second pass.
See my second email, where I think I can get rid of the whole second pass
thing. I think you'll agree that it's an even nicer mirror image.
> There's some question about what to do if a suspend or resume fails. A
> bunch of async threads will have been launched for other devices, but
> now there won't be anything to wait for them. It's not clear how this
> should be handled.
I think the rule for "suspend fails" is very simple: you can't fail in the
async codepath. There's no sane way to return errors, and trying to would
be too complex anyway. What would you do?
In fact, even though we _can_ fail in the synchronous path, I personally
consider a device driver that ever fails its suspend to be terminally
broken. We're practically always better off suspending and simply turning
off the power than saying "uh, I failed the suspend".
I've occasionally hit a few drivers that caused suspend failures, and each
and every time it was a driver bug, and the right thing to do was to just
ignore the error and suspend anyway - returning an error code and trying
to undo the suspend is not what anybody ever really wants, even if our
model _allows_ for it.
(And the rule for "resume fails" is even simpler: there's nothing we can
really do if something fails to resume - and that's true whether the
failure is synchronous or asynchronous. The device is dead. Try to reset
it, or remove it from the device tree. Tough).
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 20:49:11 UTC
Message-ID: <fa.GhVjInDfqsIQPnpGZ6xV9n9AOcg@ifi.uio.no>
On Mon, 7 Dec 2009, Alan Stern wrote:
>
> Yes, I like this approach better and better.
>
> There is still a problem. In your code outlines, you have presented a
> classic depth-first (suspend) or depth-last (resume) tree algorithm.
Yes, I did that because that clarifies the locking rules (ie "we traverse
parents nodes last/first"), not because it was actually relevant to
anything else.
And the whole pre-order vs post-order is important, and really only shows
up when you show the pseudo-code as a tree walk.
> But that's not how the PM core works. Instead it maintains dpm_list, a
> list of all devices in order of registration.
Right. I did mention that in a couple of the asides, I'm well aware that
we don't actually traverse the tree as a tree.
But the "traverse list forward" is logically the same thing as doing
a pre-order DFS, while going backwards is equivalent to doing a post-order
DFS, since all we really care about is the whole "parent first" or
"children first" part of the ordering.
So I wanted to show the logic in pseudo-code using the tree walk (because
that explains the logic _conceptually_ much better), but the actual code
would just do the list traversal.
> The consequence is that there's no way to hand off an entire subtree to
> an async thread. And as a result, your single-pass algorithm runs into
> the kind of "stall" problem I described before.
No, look again. There's no stall in the thing, because all it really
depends on is (for the suspend path) is that it sees all children before
the parent (because the child will do a "down_read()" on the parent node
and that should not stall), and for the resume path it depends on seeing
the parent node before any children (because the parent node does that
"down_write()" on its own node).
Everything else is _entirely_ asynchronous, including all the other locks
it takes. So there are no stalls (except, of course, if we then hit limits
on numbers of outstanding async work and refuse to create too many
outstanding async things, but that's a separate issue, and intentional, of
course).
You're right that my first one (two-phase suspend) had a stall situation.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 21:42:21 UTC
Message-ID: <fa.c0b7y6jgBPOEnA5b92OGHh3cf/s@ifi.uio.no>
On Mon, 7 Dec 2009, Alan Stern wrote:
>
> It only seems that way because you didn't take into account devices
> that suspend synchronously but whose children suspend asynchronously.
But why would I care? If somebody suspends synchronously, then that's what
he wants.
> A synchronous suspend routine for a device with async child suspends
> would have to look just like your usb_node_suspend():
Sure. But that sounds like a "Doctor, it hurts when I do this" situation.
Don't do that.
Make the USB host controller do its suspend asynchronously. We don't
suspend PCI bridges anyway, iirc (but I didn't actually check). And at
worst, we can make the PCI _bridges_ know about async suspends, and solve
it that way - without actually making any normal PCI drivers do it.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 22:07:20 UTC
Message-ID: <fa.R1ZAPMoAFx4EcDrj2ltynHQMcQg@ifi.uio.no>
On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > Make the USB host controller do its suspend asynchronously. We don't
> > suspend PCI bridges anyway, iirc (but I didn't actually check). And at
> > worst, we can make the PCI _bridges_ know about async suspends, and solve
> > it that way - without actually making any normal PCI drivers do it.
>
> This sounds suspiciously like pushing the problem up a level and
> hoping it will go away. (Sometimes that even works.)
The "we don't suspend bridges anyway" is definitely a "hoping it will go
away" issue. I think we did suspend bridges for a short while during the
PM switch-over some time ago, and it worked most of the time, and then on
some machines it just didn't work at all. Probably because ACPI ends up
touching registers behind bridges that we closed down etc.
So PCI bridges are kind of special. Right now we don't touch them, and if
we ever do, that will be another issue.
> In the end it isn't a very big issue. Using one vs. two passes in
> dpm_suspend() is pretty unimportant.
I also suspect that even if you do the USB host controller suspend
synchronously, doing the actual USB devices asynchronously would still
help - even if it's only "asynchronously per bus" thing.
So in fact, it's probably a good first step to start off doing only the
USB devices, not the controller.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 22:17:26 UTC
Message-ID: <fa.TvJ1oiH7X3J2tF8Qqa4zixHTjhI@ifi.uio.no>
On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
>
> BTW, I still don't quite understand why not to put the parent's down_write
> operation into the core. It's not going to hurt for the "synchronous" devices
> and the "asynchronous" ones will need to do it anyway.
That's what I started out doing (see the first pseudo-code with the two
phases). But it _does_ actually hurt.
Because it will hurt exactly for the "multiple hubs" case: if you have two
USB hubs in parallel (and the case that Alan pointed out about a USB host
bridge is the exact same deal), then you want to be able to suspend and
resume those two independent hubs in parallel too.
But if you do the "down_write()" synchronously in the core, that means
that you are also stopping the whole "traverse the tree" thing - so now
you aren't handling the hubs in parallel even if you are handling all the
devices _behind_ them asynchronously.
This "serialize while traversing the tree" was what I was initially trying
to avoid with the two-phase approach, but that I realized (after writing
the resume path) that I could avoid much better by just moving the parents
down_write into the asynchronous path.
> Also it looks like that's something to do unconditionally for all nodes
> having children, because the parent need not know if the children do async
> operations.
True, and that was (again) the first iteration. But see above: in order to
allow way more concurrency, you don't want to introduce the false
dependency between the write-lock and the traversal of the tree (or, as
Alan points out - just a list - but that doesn't really change anything)
that is introduced by taking the lock synchronously.
So by moving the write-lock to the asynchronous work that also shuts down
the parent, you avoid that whole unnecessary serialization. But that means
that you can't do the lock in generic code.
Unless you want to do _all_ of the async logic in generic code and
re-introduce the "dev->async_suspend" flag. I would be ok with that now
that the infrastructure seems so simple.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 16:38:42 UTC
Message-ID: <fa.g2p6xhh5Ddtqd9BeEJnvL+7l8AQ@ifi.uio.no>
On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
>
> > The advantage: untouched drivers don't change ANY SEMANTICS AT ALL.
>
> This also was true for my patchset.
That's simply not true.
You set async_suspend on every single PCI driver. I object very heavily to
it.
You also introduce this whole big "callback when ready", and
"non-topological PM dependency chain" thing. Which I also object to.
Notice how with the simpler "lock parent" model, you _can_ actually encode
non-topological dependencies, but you do it by simply read-locking
whatever other independent device you want. So if an architecture has some
system devices that have odd rules, that architecture can simply encode
those rules in its suspend() functions.
It doesn't need to expose it to the device layer - because the device
layer won't even care. The code will just automatically "do the right
thing" without even having that notion of PM dependencies at any other
level than the driver that knows about it.
No registration, no callbacks, no nothing.
> In my patchset the drivers didn't need to do all that stuff. The only thing
> they needed, if they wanted their suspend/resume to be executed
> asynchronously, was to set the async_suspend flag.
In my patchset, the drivers don't need to either.
The _only_ thing that would do this is something like the USB layer. We're
talking ten lines of code or so. And you get rid of all the PM
dependencies and all the infrastructure - because the model is so simple
that it doesn't need any.
(Well, except for the infrastructure to run things asynchronously, but
that was kind of my point from the very beginning: we can just re-use all
that existing async infrastructure. We already have that).
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [GIT PULL] PM updates for 2.6.33
Date: Mon, 07 Dec 2009 20:56:45 UTC
Message-ID: <fa.zUd+PZO4lfM7e9ymaBhIW/UcV2k@ifi.uio.no>
On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
>
> So I guess the only thing we need at the core level is to call
> async_synchronize_full() after every stage of suspend/resume, right?
Yes and no.
Yes in the sense that _if_ everybody always uses "async_schedule()" (or
whatever the call is named - I've really only written pseudo-code and
haven't even tried to look up the details), then the only thing you need
to do is async_synchronize_full().
But one of the nice things about using just the trivial rwlock model and
letting any async users just depend on that is that we could easily just
depend entirely on those device locks, and allow drivers to do async
shutdowns other ways too.
For example, I could imagine some driver just doing an async suspend (or
resume) that gets completed in an interrupt context, rather than being
done by 'async_schedule()' at all.
So in many ways it's nicer to serialize by just doing
serialize_all_PM_events()
{
for_each_device() {
down_write(dev->lock);
up_write(dev->lock);
}
}
rather than depend on something like async_synchronize_full() that
obviously waits for all async events, but doesn't have the capability to
wait for any other event that some random driver might be using.
[ That "down+up" is kind of stupid, but I don't think we have a "wait for
unlocked" rwsem operation. We could add one, and it would be cheaper for
the case where the device never did anything async at all, and didn't
really need to dirty that cacheline by doing that write lock/unlock
pair. ]
But that really isn't a big deal. I think it would be perfectly ok to also
just say "if you do any async PM, you need to use 'async_schedule()'
because that's all we're going to wait for". It's probably perfectly fine.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Tue, 08 Dec 2009 22:33:03 UTC
Message-ID: <fa.YbtcOFIsqsRV/kArb8BaxqA7p4M@ifi.uio.no>
On Tue, 8 Dec 2009, Alan Stern wrote:
>
> Suppose A and B are unrelated devices and we need to impose the
> off-tree constraint that A suspends after B.
Ah. Ok, I can imagine the off-tree constraints, but part of my "keep it
simple" was to simply not do them. If there are constraints that aren't
in the topology of the tree, then I simply don't think that async is worth
it in the first place.
> You misunderstand. The suspend algorithm will look like this:
>
> dpm_suspend()
> {
> list_for_each_entry_reverse(dpm_list, dev) {
> down_write(dev->lock);
> async_schedule(device_suspend, dev);
> }
> }
>
> device_suspend(dev)
> {
> device_for_each_child(dev, child) {
> down_read(child->lock);
> up_read(child->lock);
> }
> dev->suspend(dev); /* May do off-tree down+up pairs */
> up_write(dev->lock);
> }
Ok, so the above I think work (and see my previous email: I think
completions would be workable there too).
It's just that I think the "looping over children" is ugly, when I think
that by doing it the other way around you can make the code simpler and
only depend on the PM device list and a simple parent pointer access.
I also think that you are wrong that the above somehow protects against
non-topological dependencies. If the device you want to keep delay
yourself suspending for is after you in the list, the down_read() on that
may succeed simply because it hasn't even done its down_write() yet and
you got scheduled early.
But I guess you could do that by walking the list twice (first to lock
them all, then to actually call the suspend function). That whole
two-phase thing, except the first phase _only_ locks, and doesn't do any
callbacks.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Wed, 09 Dec 2009 15:39:03 UTC
Message-ID: <fa.3PI1UP5YCPhzcu33WqGqJeYIAqE@ifi.uio.no>
On Wed, 9 Dec 2009, Alan Stern wrote:
>
> In fact these considerations already affect the USB resume operations,
> even without asynchronous resume. The code relies on the fact that the
> PCI layer registers sibling devices on a slot in order of increasing
> function number. There's no guarantee this will remain true in the
> future (it may already be wrong AFAIK), so putting in some explicit
> list manipulation is the prudent thing to do.
I do think we want to keep the slot ordering.
One of the silent issues that the device management code has always had is
the whole notion of naming stability. Now, udev and various fancy naming
schemes solve that at a higher level, but it is still the case that we
_really_ want basic things like your PCI controllers to show up in stable
order.
For example, it is _very_ inconvenient if things like PCI probing ends up
allocating different bus numbers (or resource allocations) across reboots
even if the hardware hasn't been changed. Just from a debuggability
standpoint, that just ends up being a total disaster.
For example, we continually hit odd special cases where PCI resource
allocation has some unexplained problem because there is some motherboard
resource that is hidden and invisible to our allocator. They are rare in
the sense that it's usually just a couple of odd laptops or something, but
they are not rare in the sense that pretty much _every_ single time we
change some resource allocation logic, we find one or two machines that
have some issue.
Things like that would be total disasters if the core device layer then
ended up also not having well-defined ordering. This is why I don't want
to do asynchronous PCI device probing, for example (ie we probe the
hardware synchronously, the PCI driver sets it all up synchronously, and
the asynchronous portion is the non-PCI part if any - things like PHY
detection, disk spinup etc).
So async things are fine, but they have _huge_ disadvantages, and I'll
personally take reliability and a stable serial algorithm over an async
one as far as possible.
That's partly why I really did suggest that we do the async stuff purely in
the USB layer, rather than try to put it deeper in the device layer. And
if we do support it "natively" in the device layer like Rafael's latest
patch, I still think we should be very very nervous about making devices
async unless there is a measured - and very noticeable - advantage.
So I really don't want to push things any further than absolutely
necessary. I do not think that something like "embedded audio" is a reason
for async, for example.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Wed, 09 Dec 2009 16:58:38 UTC
Message-ID: <fa.iQg+HD/LBj3OMR0jNOnEsYsrYa4@ifi.uio.no>
On Wed, 9 Dec 2009, Mark Brown wrote:
> > How long does it take to bring down the entire embedded audio
> > subsystem? And how critical is the timing for typical systems?
>
> Worst case is about a second for both resume and suspend which means two
> seconds total but it's very hardware dependent.
I would seriously suggest just looking at the code itself.
Maybe the code is just plain sh*t? If we're talking embedded audio, we're
generally talking SoC chips (maybe some external audio daughtercard), and
quite frankly, it sounds to me like you're just wasting your own time.
There is no way that kind of hardware really needs that much time.
We should not design the device infrastructure for crap coding.
Now, I can easily see one-second delays in code that simply has never been
thought about or cared about it. We used to have things like that in the
serial code where just probing for non-existent serial ports took half a
second per port because there was a timeout.
But christ, using that as an argument for "we should do things
asynchronously" sounds like a crazy idea. Why not just take a hard look at
the driver in question, asking hard questions like "does it really need to
do something horrible like that"?
Because bad coding is much more likely to be the real reason.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Wed, 09 Dec 2009 17:57:40 UTC
Message-ID: <fa.yp8Zv5EbaavzZAx2oR41bvpoGH4@ifi.uio.no>
On Wed, 9 Dec 2009, Mark Brown wrote:
>
> The problem comes when you've got audio outputs referenced to something
> other than ground which used to happen because no negative supplies were
> available in these systems. To bring these up from cold you need to
> bring the outputs up to the reference level but if you do that by just
> turning on the power you get an audible (often loud) noise in the output
> from the square(ish) waveform that results which users don't find
> acceptable.
Ouch. A second still sounds way too long - but whatever.
However, it sounds like the nice way to do that isn't by doing it
synchronously in the suspend/resume code itself, but simply ramping it
down (and up) from a timer. It would be asynchronous, but not because the
suspend itself is in any way asynchronous.
Done right, it might even result in a nice volume fade of the sound (ie if
the hw allows for it, stop the actual sound engine late on suspend, and
start it early on resume, so that sound works _while_ the whole reference
volume rampdown/up is going on)
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Wed, 09 Dec 2009 17:20:07 UTC
Message-ID: <fa.9e50EhuFPTwo+CDPv5fXigzDvM4@ifi.uio.no>
On Wed, 9 Dec 2009, Alan Stern wrote:
>
> If audio _is_ being played when a suspend occurs, users probably don't
> mind audible artifacts. In fact, they probably expect some.
I'd say it's physically impossible not to get them. If you're really
suspending your audio hardware, it _will_ be quiet ;)
I suspect somebody is draining existing queues or something, or just
probing for an external analog part. Neither of which is really sensible
or absolutely required in an embedded suspend/resume kind of situation.
Especially for STR, just "leave all the data structures around, and just
stop the DMA engine" is often a perfectly fine solution - but drivers
don't do it, exactly because we've often had the mentality that you
re-initialize everything under the sun.
I can see _why_ a driver would do that ("we re-use the same code that we
use on close/open or module unload/reload"), but it doesn't change the
fact that it's stupid to do if you worry about latency.
And yeah, turning it async might hide the problem. But the code word there
is "hide" rather than "fix".
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Fri, 11 Dec 2009 22:32:09 UTC
Message-ID: <fa.6Du8tygHKMRICpKPowqo531CyCk@ifi.uio.no>
On Fri, 11 Dec 2009, Rafael J. Wysocki wrote:
>
> But fine, say we use the approach based on rwsems and consider suspend and
> the inner lock. We acquire it using down_write(), because we want to wait for
> multiple other drivers. Now, in fact we could do literally
>
> down_write(dev->power.rwsem);
> up_write(dev->power.rwsem);
>
> because the lock doesn't really protect anything from anyone. What it does is
> to prevent _us_ from doing something too early. To me, personally, it's not a
> usual use of locks.
I agree that it's fairly unusual, but on the other hand, it's unusual only
because you contrived it to be.
If you instead do
down_write(dev->power.rwsem);
.. do the actual suspend ..
up_write(dev->power.rwsem);
it doesn't look odd any more, does it? And while you don't _need_ to hold
the power lock over the suspend call, it actually does make sense, and
gives you some nicer guarantees.
For an example of the kinds of guarantees it would give you - I think that
you might actually be able to do a partial suspend and then a resume
without any other locks, and you'd know that just the per-device locking
would already guarantee that no device is ever tried to resume before it
has finished its asynchronous suspend.
Think about it.
In the completion model, the "async_synchronize_full()" will synchronize
all async work, and as a result you think that you don't need that level
of robustness from the locking itself.
But think about it this way: if you could abort a failed suspend, and
start resuming devices immediately, without doing that
"async_synchronize_full()" in between - simply because you know that the
node locking itself will just "do the right thing".
To me, that's a sign of a _good_ design. Using a rwsem is simply just more
robust and natural for the problem in question. Exactly because it's a
real lock.
> > Don't try to make up problems. The _only_ subsystem we know wants this is
> > USB, and we know USB is purely a tree.
>
> Not really.
>
> I've already said it once, but let me repeat. Some device objects have those
> ACPI "shadow" device objects that represent the ACPI view of given "physical"
> device and have their own suspend and resume routines. It turns out that
> these ACPI "shadow" devices have to be suspended after their "physical"
> counterparts and resumed before them, or else things break really badly.
> I don't know the reason for that, I only verified it experimentally (I also
> don't like that design, but I didn't invent it and I have to live with it at
> least for now). So if we don't enforce these constraints doing async
> suspend and resume, we won't be able to handle _any_ devices with those
> ACPI "shadow" things asynchronously. Ever. [That includes the majority
> PCI devices, at least the "planar" ones (which is unfortunate, but that's how
> it goes).]
So?
First off, you're wrong. It's not "ever". I'm happy to add complexity
later, I just don't want to start out with a complex model. Adding
complexity too early "just because we might need it" is the wrong thing to
do.
Secondly, I repeat: we don't want to do those PCI devices asynchronously
anyway. You're again digging yourself deeper by just continually bringing
up this total non-issue. I realize you did it for testing, but I'm serious
when I say that we should limit these things as much as possible, rather
than see it as an opportunity to do crazy things.
Solve the problem at hand _first_. Solve it as simply as you can. And hope
that you never ever will need anything more complex.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Fri, 11 Dec 2009 23:54:06 UTC
Message-ID: <fa.Om96XHLfunS7RE1ZWibO6M9SZME@ifi.uio.no>
On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
>
> Below is a patch I've just tested, but there's a lockdep problem in it I don't
> know how to solve. Namely, lockdep is apparently unhappy with us not releasing
> the lock taken in device_suspend() and it complains we take it twice in a row
> (which we do, but for another device). I need to use down_read_non_owner()
> to make it shut up and then I also need to use up_read_non_owner() in
> __device_suspend(),
Ok, that I admit is actually a problem.
Ok, ok, I'll accept that completion() version, even though I think it's
inferior.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Sun, 13 Dec 2009 01:19:22 UTC
Message-ID: <fa.UcCoXLW980fjRB0kgCToqOS9eAE@ifi.uio.no>
On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
>
> I'd like to put it into my tree in this form, if you don't mind.
This version still has a major problem, which is not related to
completions vs rwsems, but simply to the fact that you wanted to do this
at the generic device layer level rather than do it at the actual
low-level suspend/resume level.
Namely that there's no apparent sane way to say "don't wait for children".
PCI bridges that don't suspend at all - or any other device that only
suspends in the 'suspend_late()' thing, for that matter - don't have any
reason what-so-ever to wait for children, since they aren't actually
suspending in the first place. But you make them wait regardless, which
then serializes things unnecessarily (for example, two unrelated USB
controllers).
And no, making _everything_ be async is _not_ the answer.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Mon, 14 Dec 2009 18:22:34 UTC
Message-ID: <fa.9Ml0pig3QidmuuuMYK7dyjTusIw@ifi.uio.no>
On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
>
> One solution that we have discussed on linux-pm is to start a bunch of async
> threads searching for async devices that can be suspended and suspending
> them (assuming suspend is considered) out of order with respect to dpm_list.
Ok, guys, stop the crazy.
That's another of those "ok, that's just totally stupid and clearly too
complex" ideas that I would never pull.
I should seriously suggest that people just stop discussing architectural
details on the pm list if they all end up being this level of crazy.
The sane thing to do is to just totally ignore the async layer on PCI
bridges and other things that only have a late-suspend/early-resume thing.
No need for the above kind of obviously idiotic crap.
However, my point was really that we wouldn't even have _needed_ that kind
of special case if we had just decided to let the subsystems do it. But
whatever. At worst, the PCI layer can even just mark such devices with
just late/early suspend/resume as being asynchronous, even though that
ends up resulting in some totally pointless async work that doesn't do
anything.
But please guys - reign in the crazy ideas on the pm list. It's not like
our suspend/resume has gotten so stable as to be boring, and we want it to
become unreliable again.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Mon, 14 Dec 2009 22:42:29 UTC
Message-ID: <fa.ihrUDIRPGXY4ujEXnCVmo/kZUEA@ifi.uio.no>
On Mon, 14 Dec 2009, Rafael J. Wysocki wrote:
> OK, what about a two-pass approach in which the first pass only inits the
> completions and starts async threads for leaf "async" devices? I think leaf
> devices are most likely to take much time to suspend, so this will give us
> a chance to save quite some time.
Why?
Really.
Again, stop making it harder than it needs to be.
Why do you make up these crazy schemes that are way more complex than they
need to be?
Here's an untested one-liner that has a 10-line comment.
I agree it is ugly, but it is ugly exactly because the generic device
layer _forces_ us to wait for children even when we don't want to. With
this, that unnecessary wait is now done asynchronously.
I'd rather do it some other way - perhaps having an explicit flag that
says "don't wait for children because I'm not going to suspend myself
until 'suspend_late' _anyway_". But at least this is _simple_.
Linus
---
drivers/pci/probe.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 98ffb2d..4e0ad7b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -437,6 +437,17 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
}
bridge->subordinate = child;
+ /*
+ * We don't really suspend PCI buses asyncronously.
+ *
+ * However, since we don't actually suspend them at all until
+ * the late phase, we might as well lie to the device layer
+ * and it to do our no-op not-suspend asynchronously, so that
+ * we end up not synchronizing with any of our child devices
+ * that might want to be asynchronous.
+ */
+ bridge->dev.power.async_suspend = 1;
+
return child;
}
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Mon, 14 Dec 2009 22:43:57 UTC
Message-ID: <fa.UjJUewd87YG3xl6JFCl6/JewkMo@ifi.uio.no>
On Mon, 14 Dec 2009, Linus Torvalds wrote:
>
> Here's an untested one-liner that has a 10-line comment.
Btw, when I say "untested", in this case I mean that it isn't even
compile-tested. I haven't merged your other patches yet, so in my tree
that 'async_suspend' flag doesn't even exist, and the patch I sent out
definitely doesn't compile.
But it _might_ compile (and perhaps even work) in your tree.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Tue, 15 Dec 2009 00:10:34 UTC
Message-ID: <fa.BdeZplbNf4XU5oIm8XgBfKrdUaM@ifi.uio.no>
On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
>
> Because the PCI bridges are not the only case where it matters (I'd say they
> are really a corner case). Basically, any two async devices separeted by a
> series of sync ones are likely not to be suspended (or resumed) in parallel
> with each other, because the parent is usually next to its children in dpm_list.
Give a real example that matters.
Really.
How hard can it be to understand: KISS. Keep It Simple, Stupid.
I get really tired of this whole stupid async discussion, because you're
overdesigning it.
To a first approximation, THE ONLY THING THAT MATTERS IS USB.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Tue, 15 Dec 2009 00:12:09 UTC
Message-ID: <fa.07gTJ9xkaFd9Cp2rz8QK2ysEhaM@ifi.uio.no>
On Mon, 14 Dec 2009, Linus Torvalds wrote:
>
> I get really tired of this whole stupid async discussion, because you're
> overdesigning it.
Btw, this is important. I'm not going to pull even your _current_ async
stuff if you can't show that you fundamentally UNDERSTAND this fact.
Stop making up idiotic complex interfaces. Look at my one-liner patch, and
realize that it gets you 99% there - the 99% that matters.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Tue, 15 Dec 2009 15:32:33 UTC
Message-ID: <fa.2pE1+7I6xyYPjoNR+BatIWhukVk@ifi.uio.no>
On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
>
> What fact? The only thing that matters is USB? For resume, it is. For
> suspend, it clearly isn't.
For suspend, the only other case we've seen has been the keyboard and
mouse controller, which has exactly the same "we can special case it with
a single 'let's do _this_ device asynchronously'". Again, it may not be
pretty, but it sure is simple.
Much simpler than talking about some generic infrastructure changes and
about doing "let's do leaves of the tree separately" schemes.
And that's why I'm _soo_ unhappy with you, and am insulting you. Because
you keep on making the same mistake over and over - overdesigning.
Overdesigning is a SIN. It's the archetypal example of what I call "bad
taste". I get really upset when a subsystem maintainer starts
overdesigning things.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Tue, 15 Dec 2009 15:31:32 UTC
Message-ID: <fa.YQdqD0S9EP71JfMvY3tg15THOz0@ifi.uio.no>
On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Give a real example that matters.
>
> I'll try. Let -> denote child-parent relationships and assume dpm_list looks
> like this:
No.
I mean something real - something like
- if you run on a non-PC with two USB buses behind non-PCI controllers.
- device xyz.
> If this applies to _resume_ only, then I agree, but the Arjan's data clearly
> show that serio devices take much more time to suspend than USB.
I mean in general - something where you actually have hard data that some
device really needs anything more than my one-liner, and really _needs_
some complex infrastructure.
Not "let's imagine a case like xyz".
> But if we only talk about resume, the PCI bridges don't really matter,
> because they are resumed before all devices that depend on them, so they don't
> really need to wait for anyone anyway.
But that's my _point_. That's the whole point of the one-liner patch. Read
the comment above that one-liner.
My whole point was that by doing the whole "wait for children" in generic
code, you also made devices - such as PCI bridges - have to wait for
children, even though they don't need to, and don't want to.
So I suggested an admittedly ugly hack to take care of it - rather than
some complex infrastructure.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Tue, 15 Dec 2009 16:29:25 UTC
Message-ID: <fa.ozcif3gKltCjaIYtD8M1Yd0Bl00@ifi.uio.no>
On Tue, 15 Dec 2009, Alan Stern wrote:
>
> It doesn't feel like an ugly hack to me. It seems like exactly the
> Right Thing To Do: Make as many devices as possible use async
> suspend/resume.
The reason it's a ugly hack is that it's actually not a simple decision to
make. The devil is in the details:
> The only reason we don't make every device async is because we don't
> know whether it's safe. In the case of PCI bridges we _do_ know --
> because they don't have any work to do outside of
> late_suspend/early_resume -- and so they _should_ be async.
That's the theory, yes. And it was worth the comment to spell out that
theory. But..
It's a very subtle theory, and it's not necessarily always 100% true. For
example, a cardbus bridge is strictly speaking very much a PCI bridge, but
for cardbus bridges we _do_ have a suspend/resume function.
And perhaps worse than that, cardbus bridges are one of the canonical
examples where two different PCI devices actually share registers. It's
quite common that some of the control registers are shared across the two
subfunctions of a two-slot cardbus controller (and we generally don't even
have full docs for them!)
> The same goes for devices that don't have suspend or resume methods.
Yes and no.
Again, the "async_suspend" flag is done at the generic device layer, but
99% of all suspend/resume methods are _not_ done at that level: they are
bus-specific functions, where the bus has a generic suspend-resume
function that it exposes to the generic device layer, and that knows about
the bus-specific rules.
So if you are a PCI device (to take just that example - but it's true of
just about all other buses too), and you don't have any suspend or resume
methods, it's actually impossible to see that fact from the generic device
layer.
And even when you know it's PCI, our rules are actually not simple at all.
Our rules for PCI devices (and this strictly speaking is true for bridges
too) are rather complex:
- do we have _any_ legacy PM support (ie the "direct" driver
suspend/resume functions in the driver ops, rather than having a
"struct dev_pm_ops" pointer)? If so, call "->suspend()"
- If not - do we have that "dev_pm_ops" thing? If so, call it.
- If not - just disable the device entirely _UNLESS_ you're a PCI bridge.
Notice? The way things are set up, if you have no suspend routine, you'll
not get suspended, but you will get disabled.
So it's _not_ actually safe to asynchronously suspend a PCI device if that
device has no driver or no suspend routines - because even in the absense
of a driver and suspend routines, we'll still least disable it. And if
there is some subtle dependency on that device that isn't obvious (say, it
might be used indirectly for some ACPI thing), then that async suspend is
the wrong thing to do.
Subtle? Hell yes.
So the whole thing about "we can do PCI bridges asynchronously because
they are obviously no-op" is kind of true - except for the "obviously"
part. It's not obvious at all. It's rather subtle.
As an example of this kind of subtlety - iirc PCIE bridges used to have
suspend and resume bugs when we initially switched over to the "new world"
suspend/resume exactly because they actually did things at "suspend" time
(rather than suspend_late), and that broke devices behind them (this was
not related to async, of course, but the point is that even when you look
like a PCI bridge, you might be doing odd things).
So just saying "let's do it asynchronously" is _not_ always guaranteed to
be the right thing at all. It's _probably_ safe for at least regular PCI
bridges. Cardbus bridges? Probably not, but since most modern laptop have
just a single slot - and people who have multiple slots seldom use them
all - most people will probably never see the problems that it _could_
introduce.
And PCIE bridges? Should be safe these days, but it wasn't quite as
obvious, because a PCIE bridge actually has a driver unlike a regular
plain PCI-PCI bridge.
Subtle, subtle.
> There remains a separate question: Should async devices also be forced
> to wait for their children? I don't see why not. For PCI bridges it
> won't make any significant difference. As long as the async code
> doesn't have to do anything, who cares when it runs?
That's why I just set the "async_resume = 1" thing.
But there might actually be reasons why we care. Like the fact that we
actually throttle the amount of parallel work we do in async_schedule().
So doing even a "no-op" asynchronously isn't actually a no-op: while it is
pending (and those things can be pending for a long time, since they have
to wait for those slow devices underneath them), it can cause _other_
async work - that isn't necessarily a no-op at all - to be then done
synchronously.
Now, admittedly our async throttling limits are high enough that the above
kind of detail will probably never ever realy matter (default 256 worker
threads etc). But it's an example of how practice is different from theory
- in _theory_ it doesn't make any difference if you wait for something
asynchronously, but in practice it could make a difference under some
circumstances.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Tue, 15 Dec 2009 18:58:32 UTC
Message-ID: <fa.RN6gKhOo/3vcD1MoN5Iolf7S/SE@ifi.uio.no>
On Tue, 15 Dec 2009, Linus Torvalds wrote:
>
> And even when you know it's PCI, our rules are actually not simple at all.
> Our rules for PCI devices (and this strictly speaking is true for bridges
> too) are rather complex:
>
> - do we have _any_ legacy PM support (ie the "direct" driver
> suspend/resume functions in the driver ops, rather than having a
> "struct dev_pm_ops" pointer)? If so, call "->suspend()"
>
> - If not - do we have that "dev_pm_ops" thing? If so, call it.
>
> - If not - just disable the device entirely _UNLESS_ you're a PCI bridge.
>
> Notice? The way things are set up, if you have no suspend routine, you'll
> not get suspended, but you will get disabled.
Side note - what I think might be a clean solution for PCI at least is to
do something like the following:
- move that "disable the device entirely" thing to suspend_late, rather
than the earlier suspend phase. Now PCI devices without drivers or PM
will not be touched at all in the first suspend phase.
- initialize all PCI devices to have 'async_suspend = 1' on discovery
- whenever we bind a driver to the PCI device, we'd then look at whether
that driver implements suspend/resume callbacks (legacy or new), and
clear the async_suspend bit if so.
That way we'd have the same old synchronous behavior for all PCI suspend
and resume events (unless the driver itself then sets the async_suspend
bit at device init time, which it could do, of course), while still always
doing async "no-op" events.
That would avoid the ugly one-liner that just "knows" that PCI bridges are
special and don't do anything at suspend time (even though they aren't
really - a PCI bridge _could_ have a driver associated with it that does
something that might not be happy being asynchronous).
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Tue, 15 Dec 2009 22:16:58 UTC
Message-ID: <fa.DRjrScDA324FQgBS8CvqplCz0mo@ifi.uio.no>
On Tue, 15 Dec 2009, Alan Stern wrote:
>
> Okay. This obviously implies that if/when cardbus bridges are
> converted to async suspend/resume, the driver should make sure that the
> lower-numbered devices wait for their sibling higher-numbered devices
> to suspend (and vice versa for resume). Awkward though it may be.
Yes. However, this is an excellent case where the whole "the device layer
does things asynchronously" is really rather awkward.
For cardbus, the nicest model really would be for the _driver_ to decide
to do some things asynchronously, after having done some other things
synchronously (to make sure of ordering).
That said, I think we are ok for at least Yenta resume, because the really
ordering-critical stuff we tend to do at "resume_early", which wouldn't be
asynchronous anyway.
But for an idea of what I'm talking about, look at the o2micro stuff in
drivers/pcmcia/o2micro.h, and notice how it does certain things only for
the "PCI_FUNC(..devfn) == 0" case.
So I suspect that we _can_ just do cardbus bridges asynchronously too, but
it really needs some care. I suspect to a first approximation we would
want to do the easy cases first, and ignore cardbus as being "known to
possibly have issues".
> > Subtle? Hell yes.
>
> I don't disagree. However the subtlety lies mainly in the matter of
> non-obvious dependencies.
Yes. But we don't necessarily even _know_ those dependencies.
The Cardbus ones I know about, but really only because I wrote much of
that code initially when converting cardbus to look like the PCI bridge it
largely is. But how many other cases like that do we have that we have
perhaps never even hit, because we've never done anything out of order.
> The ACPI relations are definitely something to worry about. It would
> be a good idea, at an early stage, to add those dependencies
> explicitly. I don't know enough about them to say more; perhaps Rafael
> does.
Quite frankly, I would really not want to do ACPI first at all.
We already handle batteries specially, but any random system device? Don't
touch it, is my suggestion. There is just too many ways it can fail. Don't
tell me that things "should work" - we know for a fact that BIOS tables
almost always have every single bug they could possibly have).
> > And PCIE bridges? Should be safe these days, but it wasn't quite as
> > obvious, because a PCIE bridge actually has a driver unlike a regular
> > plain PCI-PCI bridge.
> >
> > Subtle, subtle.
>
> Indeed. Perhaps you were too hasty in suggesting that PCI bridges
> should be async.
Oh, yes. I would suggest that first we do _nothing_ async except for
within just a single USB tree, and perhaps some individual drivers like
the PS/2 keyboard controller (and do even that perhaps only for the PC
version, which we know is on the southbridge and not anywhere else).
If that ends up meaning that we block due to PCI bridges, so be it. I
really would prefer baby steps over anything more complete.
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Sat, 19 Dec 2009 23:46:48 UTC
Message-ID: <fa.uVAh4VOHCtmRd4B/DQjqsyaLZiM@ifi.uio.no>
On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
>
> OK, so this means we can just forget about suspending/resuming i8042
> asynchronously, which is a pity, because that gave us some real suspend
> speedup on my test systems.
No. What it means is that you shouldn't try to come up with these idiotic
scenarios just trying to make trouble for yourself, and using it as an
excuse for crap.
I suggest you try to treat the i8042 controller async, and see if it is
problematic. If it isn't, don't do that then. But we actually have no real
reason to believe that it would be problematic, at least on a PC where the
actual logic is on the SB (presumably behind the LPC controller).
Why would it be?
The fact that PnP and ACPI enumerates those devices has exactly _what_ to
do with anything?
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Sat, 19 Dec 2009 23:47:24 UTC
Message-ID: <fa.tRNI/r54BEU3090r6xHE+Cv0bRk@ifi.uio.no>
On Sat, 19 Dec 2009, Linus Torvalds wrote:
>
> I suggest you try to treat the i8042 controller async, and see if it is
> problematic. If it isn't, don't do that then.
I obviously meant: "If it _is_ problematic, don't do that then". "Is", not
"isn't".
Linus
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async suspend-resume patch w/ completions (was: Re: Async
Date: Sun, 20 Dec 2009 00:09:44 UTC
Message-ID: <fa.bbQlUW82QN+unVgXJRU9ZChTkas@ifi.uio.no>
On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Why would it be?
>
> The embedded controller may depend on it.
Again, I say "why?"
Anything can be true. That doesn't _make_ everything true. There's no real
reason why PnP/ACPI suspend/resume should really care.
We can try it. Not for 2.6.33, but by the 34 merge window maybe we'll have
a patch-series that is ready to be tested, and that aggressively tries to
do the devices that matter asynchronously.
So instead of you trying to make up some idiotic cross-device worries,
just see if those worries have any actual background in reality. So far I
haven't actually heard anything but "in theory, anything is possible",
which is such a truism that it's not even worth voicing.
That said, I still get the feeling that we'd be even better off simply
trying to avoid the whole keyboard reset entirely. Apparently we do it for
a few HP laptops. It's entirely possible that we'd be better off simply
not _doing_ the slow thing in the first place.
For example, we may be _much_ better off doing that whole keyboard reset
at resume time than at suspend time. That's what we do when we probe
things on initialization - and the resume-time keyboard code is actually
already asynchronous, it does that atkbd_reconnect asynchronously by
queuing it as an event.
So again, all these problems may not at all be fundamnetal problems: the
keyboard driver does certain things, but there is no guarantee that it
_needs_ to do those things. Turning the driver async may be totally the
wrong thing to do, when we could potentially fix latency problems at the
driver level instead.
Linus
Index
Home
About
Blog