Index
Home
About
Blog
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 8 Aug 1998 06:38:07 GMT
Keywords: reduce
Having a zero register is a modest, but useful technique, which is
why so many recent architectures support it, because it generally
*simplifies* hardware, given the chain of assumptions usually already
made in ISAs before this decision gets made.
1) Contrary to some of the fantasies posted here, a zero register has
minimal implementation cost. It is *not* implemented as bunches of
gates that do special case checking for every instruction. It is
simply implemented to act as a "register" that delivers zeroes when read,
and ignores anything that is written. This is, of course, trivial
hardware, adds no gate delays to anything, and requires no special-casing
of logic. Recommendation: if this doesn't make sense, it would be
covered in an appropriate mid-level undergraduate course on digital
design, like Stanford CS112/EE182:
http://cs-class.stanford.edu/class/cs112/
or perhaps, just get hennessy & patterson's
Computer Organization and Design
Note that special-casing move or clear, for example, to avoid a trip through
an ALU, is just not something designers care about in recent designs.
(I think the datapath section of h&P above discusses such stuff.)
IT is more likely to cost hardware to special-case this.
2) Start with the following assumptions, which one may or may not agree with,
but which are common:
a) 32-bit instructions (or, 1-size for fast decode)
b) many 3-operand instructions (for the usual reasons)
c) base-displacement addressing, at least.
Having a zero register:
a) Eliminates special-casing in the address unit to cause basereg 0
to supply a zero during addressing [i.e., like S/360, where the use
of R0 differs between normal usage and address computation].
This offers a simple way to get direct addressing for a modest range of
addresses ... which has actually been very useful for the kernel
and certain graphics libraries. This affects the 20+
loads+stores.
b) Does supply a few unary ALU operations for free, at the cost of
allocating one register. In MIPS, some of the ones that fall out
are:
3-op unary
ADD mov (there are various ways to get mov)
ADDI load-immediate (sign-extend)
DADD 64-bit mov
DADDI 64-bit load-immediate (sign-extend)
DSUB 64-bit negate
DSUBU 64-bit negate (unsigned)
NOR bitwise-not
ORI load-immediate (some values that ADDI(U) can't, like 0X8???.)
SUB 32-bit negate
SUBU 32-bit negate (unsigned)
most of these are not a big deal, although the various ways to get
load-immediate are fairly useful. There are multiple redundant ways
to get mov and clear ... but it is actually less hardware to allow
such things, than it is to special-case them away.
c) x = 0 ... is a fairly common construct, since it is a common
flag value, and this means all of the stores can just use zero as a
source with no other overhead. This gets used in bzero's, but there,
the savings are minimal.
d) Finally, any extremely important case (for MIPS, anyway) that
falls out very cleanly are the following branch instructions:
BEQ, BEQL, BNE, BNEQL
all of which have 2 register fields and use the immediate field for
a branch displacement. Comparisons to zero are *quite* common,
and this straightforwardly avoids any need to special-case.
3) This has been sufficiently useful that I've often wished we'd done
something for floating point 0.0...
4) It is simple, trivial, and safe to have a zero register.
Among other things, it would be irritating to have to waste a
cycle on every kernel entry to clear the register to make sure some
user process had not trashed it.
5) It is not a huge win to obtain the unary operations, but it does save
a few opcodes that are frequently used enough that one would have had to
add them ... and there is always pressure on opcode space. Although not
on all such ISAs, it is definitely nice not to have to special-case the
branches.
6) One might reach different conclusions for variable-size instruciton
encodings, where one may well provide operations limited to 2-operands
for density. S/360 does that, so does MIPS-16. If there are only 8 or
16 integer registers, one would think harder about dedicating one to zero.
Nevertheless, designers chose to do this because it made sense, and
saved hardware, given the 32-bit instruction, 32-integer register,
3-operand formats already chosen.
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 8 Aug 1998 19:52:21 GMT
Keywords: reduce
In article <6qi0u9$33o@senator-bedfellow.MIT.EDU>, jfc@mit.edu (John F
Carr) writes:
|> Organization: Massachvsetts Institvte of Technology
|>
|> In article <6qgrof$6gk$1@murrow.corp.sgi.com>,
|> John R. Mashey <mash@mash.engr.sgi.com> wrote:
|> >Hhaving a zero register is a modest, but useful technique, which is
|> >why so many recent architectures support it, because it generally
|> >*simplifies* hardware, given the chain of assumptions usually already
|> >made in ISAs before this decision gets made.
|> The tradeoffs have changed since the 1980s. Is a zero register a good
|> idea in a new ISA?
There's only one new ISA: IA64, so we'll see if they did it :-)
|> Extra instructions which would not be required with a zero register
|> don't impact the design as much. The original MIPS architecture had
|> very simple instruction formats and it would have been hard to find the
|> opcode space for the extra unary instructions. A decade later, MIPS IV
|> scatters opcode bits around in formerly reserved fields and the
|> simplicity is gone.
Yep, that's why one fights hard to abvoid adding things that will be
there forever, but are more usefully avoided.
|> The control/bypass logic is a lot more complex on modern chips. There are
|> many more places where a register number must be tested and the zero
|> register special-cased. On the SuperSPARC, which has a short but complex
|> pipeline including same-cycle result forwarding, a load-double instruction
|> writing %g1 causes the %g0 (zero) register to become non-zero for the next
|> two cycles. (One isn't supposed to write ldd %g1, so this isn't
|> necessarily a bug, but other SPARC implementations don't affect %g0 in
|> this situation.)
Bug in that SPARC, probably (I don't know) due to oddity that ldd is
rare instruction (in SPARC) that modifies 2 registers.
But back to the bypass stuff:
a) R2000s had the typical bypass logic of simple byapssed systems.
b) The topic is discussed in Hennessy&Patterson C O & D, p480...
which works through pipeline forwarding issues in detail for MIPS,
including the explicit zero-check ...
c) The logic already has to do register-number comparisons to set
up the MUXes before the ALU inputs, i.e., these are two 5-bit
compares of the input registers of one instruction to the
output of the previous one, and if equal, select from the
forwarding network, if unequal, take the register.
d) The zero register simply requires that, in parallel with the
register-number decoders is a zero-detect on the result register,
[which can't be any slower than the comparators, and is a small
number of gates], and that its result is combined with the
register comparison to control the input MUXes.
e) Throughout all of this discussion, we need to rememember that
digital logic is *not* C code. If the same check appears in
numerous places in a pseudo-code specification of the logic,
that doesn't mean there must be numerous instantiations of the
logic; more likely, there is one instantiation and some wires.
All this is a handful of gates, and is well-described in an
undergraduate text-book. Newer chips are more complex, but this particular
check is basically nothing compared to all of the other stuff that's going
on, and is a cheap way to avoid burning opcodes.
One more time: zero registers are *modestly* useful, cost very little logic to
implement, make sense when given the appropriate set of predecessor
decisions, and save opcodes that people would be very tempted to add
otehrwise, and therefore usually end up saving hardware. They don't fit
some ISAs, but they fit others.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE [really: zero registers]
Date: 11 Aug 1998 22:53:20 GMT
In article <35CCDA92.33965996@gmx.de>, Bernd Paysan <bernd.paysan@gmx.de> writes:
(apparently not having yet rteceived the posting:
From: mash@mash.engr.sgi.com (John R. Mashey)
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: Sat, 8 Aug 98 12:52:21 1998
...
All this is a handful of gates, and is well-described in an
undergraduate text-book. Newer chips are more complex, but this particular
check is basically nothing compared to all of the other stuff that's going
on, and is a cheap way to avoid burning opcodes.
...
And I still think this...
|> The constraints of the original MIPS are quite outdated. A register file
|> in current CPUs is ways more complicated than it was back then. All
|> high-performance implementations now have register renaming. All are
|> OOO, thus perform scheduling on the fly. All have fairly complex bypass
|> logic (e.g. the 21264 must bypass 4 results into the next cycle, and
|> since it isn't possible to read back a written value until 2 or 3 clocks
|> later, the bypass logic is much more complex). And R0=0 is a special
|> case for bypassing (and as John Carr reported, there is one example,
|> where the implementers forgot one case).
I have some passing familiarity with the R10000, which most people in this
newsgroup know is a 4-issue, superscalar, O-O-O, speculative-execution,
register-renaming CPU. To reemphasize what was said before:
a) It wasn't a problem in the R2000.
b) It wasn't a problem in the R10000.
and
c) In some sense, it is even less of problem in modern CPUs,
as logic is much cheaper, i.e., i.e., zero-register-check logic is
a smaller fraction, especially given the large bunch of comparators
needed for all of the dependency checks in wide-issue superscalars.
and
d) In many years of talking to chip designers, and getting pushback
on features, or complaints about the unnecessary implementation pain,
*this* one is one I've never once heard anyone complain about,
at Hot Chips or Microprocessor Forum dinners, or in the bars around
the valley, or in email, or elsewhere. That doesn't prove that
this isn't an issue, but it's a datapoint. The fact that there is
an example where people goofed doesn't bother me much: there are
hordes of places where people have goofed.
To get some facts, rather than opinions, I appeal to *designers* of
commercial CPUs that use renaming:
If you have a zero register:
1) Was it a major hassle, die size cost, or increase to critical path?
2) Were there special-case zero-register bugs in the first stepping?
|> I won't argue that implementing the common case isn't trivial. It is.
|> That's a typical mistake in design: the common case always looks fairly
|> easy. It's the exceptions that makes things awful. You won't see the
|> common-case gates in the transistor count of the chip, but you will see
|> the exceptions. And as somebody who feels responsible for careful
|> testing, I vote to avoid exceptions. They are difficult as hell to test.
It is good to avoid exceptions. Personally, I doubt this one would make
anyone's top-100 list of testing worries...
|> I also want to comment on your list of opcodes provided with R0=0. Given
|> you have SUBR instead of SUB, you only have to add one instruction, and
|> that's load-immediate. Since load-immediate has a special property for
|> data flow analysis (it's truly independent), there is value to add it to
|> the instruction set. Note that then it can use a larger value, since it
|> can share the format of the long jump instruction (one reg, large
|> immediate, also truly independent).
Counts, and some of the others, although of minor use,
are of use, and the branches are of definite use, since you
get both compares to zero and compares of two registers.
The property mentioned for load-immediate sounds good, but is already
subsumed by the kind of implementations found in current CPUs, i.e.,
as soon as you've done the register rename, you know it has no
input dependencies (it is *not* truly independent, because it usually
modifies its output register). It *cannot* use the MIPS long-jump format,
which provides no register specifiers ...
I'll say it again: this kind of feature is a modestly-good thing for
some ISAs, and as far as I know, has never caused any great hassle for
MIPS implementors. [Now: integer multiply÷, load...left/right,
the original FP register arrangement, and soemtimes, branch delays ... :-)]
Not every feature is a great breakthrough: the devil is in the details,
and sometimes the best you can do is get them modestly good.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 13 Aug 1998 19:45:13 GMT
In article <35D1893C.4C92@bellatlantic.net>, "Jeffrey S. Dutky"
<dutky@bellatlantic.net> writes:
|> which has a few problems: 1) we need four instructions instead
|> of only two, so this is twice as slow as it needs to be, and
|> 2) all four operations are dependant, meaning that can't be
|> easily overlapped. While the same operations on an architecture
|> with r0==0 would be written as
|>
|> opA r0,r1,r2; discard result
|> opB r3,r4,r0; operate on value zero
While I agree with you that having a zero register is useful,
having looked at large numbers of lines of generated code for machines
that have them:
a) It is very rare to execute an ALU operation that actually does
something and put the result in zero, because the only reason to
do so is to investigate side-effects, like overflows, and it
is perfectly plausible to use a scratch register as a target most
of the time. [There are some places, like in kernel interrupt
handlers, where scratch registers are in short supply].
b) a) Is worded carefully, since, for example, NOP on MIPS is
coded as a shift of zero and returned to zero ... but it doesn't
actually do anything, is just convenient.
c) Some ISAs have made use of zero as a target of load
instructions, to turn them into prefetches, which is fairly
elegant. Others have separate prefetch instructions.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 14 Aug 1998 22:22:35 GMT
In article <slrn6t7klq.7j8.mike@ducky.net>, mike@ducky.net (Mike Haertel)
writes:
|> In article <6qvumm$an2$1@murrow.corp.sgi.com>, John R. Mashey wrote:
|> > c) Some ISAs have made use of zero as a target of load
|> > instructions, to turn them into prefetches, which is fairly
|> > elegant. Others have separate prefetch instructions.
Note: MIPS uses separate prefetch instructions, for various reasons.
I was willing to call this elegant in that:
(a) Given that one has a zero register.
(b) Opcode formats likely offer natural load and store instructions
that use the zero register. From inspection of much code,
store[type] zero somewhere is fairly frequent.
(c) On the other hand, the instructions:
load[type] zero,somewhere
usually have natural encodings, but of course do not modify zero.
(d) Hence, overloading this to mean prefetch is elegant in the sense
that:
(a) It uses an encoding that is already likely to be there.
It saves burning an opcode.
(b) As discussed earlier, bypass/rename logic makes sure
that loads into zero don't get bypassed, and that
zero never gets renamed anyway.
(c) It even makes intuitive sense, in that
(d) It may make CPU evolution a little easier:
CPU ISA #1 didn't give any thought to this.
load zero,address
fetches the data, and discards it ... but of course,
compilers generally don't generate it, and the only
conceivable use is as an address probe, and in most
cases, one could use the same address, and load any
unused register, and all would be well.
By CPU ISA #2, somebody says: "need prefetch".
If you add a new opcode, unless it was reserved in
ISA #1 as "unused, but ignored", the new opcode
will get trapped by the older CPUs, which means
you may end up having to generate 2 flavors of code,
which makes ISVs unhappy.
If you do the load zero trick, and if you only generate
code where the address is legal, then the code
will run fine on both flavors of CPUs. Of coruse,
if people really ahd been usign it as an address probe,
it will not have the expected effect on CPU #2.
Of course, if #2 has a bunch of new isntructions you
want, you may end up generating seprate code anyway.
|> So I guess this is "elegant" the same way doing jumps via
|> "move to pc" (pdp-11?) is "elegant". It may look elegant in
|> concept, but any realistic implementation needs special cases
|> to recognize it, and then if you're going to do special cases
|> you would have been better off choosing a different instruction
|> encoding in the first place.
(again, MIPS did it as separate instruction, and it may be cleaner,
but I'll also bet this is another one of those special cases, where in
the implementations where it is, the number of gates is pretty small.
After all, load xx,address and prefetch address
both need to provide: address, plus:
a) address-checking required, trap if bad versus: ignore
b) Dependency on load result, or just a hint, can ignore.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 15 Aug 1998 04:49:19 GMT
In article <6r2lur$jni@gurney.reilly.home>, reilly@zeta.org.au (Andrew
Reilly) writes:
|> I think you've managed to save a total of two opcodes (load
|> immediate and prefetch) so far. See Pernd Paysan's article that
|> covers the other cases that have been brought up so far.
|>
|> That's really worth losing a general-purpose register for?
Please read what I posted carefully, and compare it with what Bernd posted.
Bernd (and you apparently) focus on arguments on the marginality of
features that I stated were of modest use, and somehow keep missing the
ones that I stated were important. I stated they were important
because I've looked at many thousands of lines of generated code for
systems that have a zero register, and because I've looked at
instruction count summaries of billions of cycles, and finally,
because I've been involved in some of the discussions of tradeoffs,
both internally, and with other people who do this.
One more time:
1) In some ISAs, the feature can save a few opcodes, and opcodes,
especially in 32-bit-fixed-size-instructions, can get precious.
2) Put another way, there is a tradeoff between adding opcodes
and reducing path-lengths, and for some features, it is almost
impossible to say whether a feature is a good addition or not
without understanding the rest of the ISA already committed.
In ISA #1, feature A may yield 1% improvement by whatever metrics
one uses, but in ISA #B, the same feature may yield .1%, because
a combination of several other features already gets most of
the cases. [Informal discussions in bars in Cupertino, or at
Hot Chips tables sometimes gets into such things.]
3) In the MIPS ISA, as I said, one major values come from
path-length reduction & code-size reduction because
store zero doesn't need a register clear. On IRIX,
try dis /unix|grep 's[bwhd](tab)zero' to see examples.
A cursory grepping finds that a static count of instructions
in IRIX has 1+ % of the instructions are stores of zero.
4) The other one I said was important was the set of
branch-equal branch-not-equal instructions, where the zero
register gives a nice non-special comparisons, so that one
can get compare-and-branch of two registers, and compare-to-zero,
both of which are frequent enough to be interesting.
|> That's really worth losing a general-purpose register for?
If you have 32: yes, which is why most people designing such have done it.
If you have 8, I'd say no, at least from experience working on 68K
code generators and looking at that code.
With 16: maybe, maybe not: none of the machines I've worked on that had
16 had zero registers.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 17 Aug 1998 01:51:16 GMT
In article <SCOTT.98Aug14174728@slave.doubleu.com>,
scott@nospam.doubleu.com (Scott Hess) writes:
|> y generate code where
|> the address is legal, then the code will run fine on both flavors
|> of CPUs. Of coruse, if people really ahd been usign it as an
|> address probe, it will not have the expected effect on CPU #2. Of
|> course, if #2 has a bunch of new isntructions you want, you may end
|> up generating seprate code anyway.
|>
|> Perhaps I'm not clear on the distinction between "prefetch, no trap"
|> and "probe for valid address" uses of loading from an address to R0.
|> Why can't it do both?
It could, and without looking at the manuals, I'd guess that
plenty of different combinations have been done; it kind of depends on the
accumulated ISA up to that point, plus expected implementations.
(a) An explicit prefetch instruction (not load zero), could be defined as:
(a) Evaluate the address, for sure, and if it causes a TLBmiss
or protection trap, the trap must be taken (or at least, if this
in an o-o-o machine, trap is taken if you actually get there).
(b) The prefetch is an optional hint, and the CPU is free to
discard it at any stage, even before checking the address,
and the prefetch never causes traps of any sort.
It is clear that if what you really want is to check and address,
for the side-effect of causing a trap, (a) works, but (b) does not.
In both cases, actually loading a cache line is optional.
(b) load zero,address, if the way prefetch is implemented,
would likely have to be defined as either (a) or (b),
but it is hard to have the same bit pattern mean both.
Note: hardware designers *really* like (b), for isntance, because if
load/store queue entries are a scare resource at some point, prefetches
can just be thrown away.
|> The sequence that comes to mind is where you test a tag and only
|> access the memory if the tag indicates it's valid - as someone
|> implied, you couldn't pull the prefetch above the tag test unless it's
|> non-trapping. Otherwise you could get a fault which wouldn't have
|> happened if the prefetch weren't present.
Yes.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 19 Aug 1998 18:02:22 GMT
Thanks goodness, a knowledgable posting, a welcome change to this thread...
In article <1998081903553700.XAA02501@ladder01.news.aol.com>,
mitchalsup@aol.com (MitchAlsup) writes:
|> And when you have a 6-bit opcode, you find that you run
Note: Mr. Keane labeled as "ridiculous and no one would do it" the idea of
6-bit opcodes. The main opcode in, for example, both MIPS and SPARC
(and others) is in fact 6-bits, although of course some main opcodes of
sub-opcodes, as for register-register operations.
|> Yes--this is EXACTLY why r0=0 is so useful, simpler operations
|> are degenerate cases of the standard (complexity) operations.
|> Since we HAVE to design the data path to handle the standard
|> cases at full operating speed, these simple operations fall out
|> for free (or even better--see above).
This is a *really* good point, as it bears on one of the most common
fallacies that comes up again and again in comp.arch, i.e.,
acting like hardware is software, when it isn't. This most often happens
when people offer opinions on difficulty of implementation without
understanding the typical implementation methods.
HARDWARE IS NOT SOFTWARE!
A lot of computations & tests, that in software, look like a lot of code,
and might be worth optimizing for special-cases, are done by parallel
hardware that has to be there, and special-casing it only makes it worse,
or doesn't help.
Instantiations of this fallacy that have shown up in this thread include:
1) The idea that there is a lot of hardware to do special-case checks for
register zero somehow wired into a lot of instructions. This is like thinking
from a C model of an instruction set that has a lot of in-line code.
2) The idea that there are big savings from avoiding a simple ALU op
in favor of special-case-detecting a MOVE operation ... when, in many
designs, trying to go around the ALU only adds wires and logic.
(yes, there are few cases where this might not be true, but in general,
given that any sensible design must make ALUs fast, trying to go around them
isn't something people care about very often.]
3) The idea that in general, there is a big savings in dependency-check hardware
by having an explicit load-immediate instruction rather than an
add-immediate, in an out-of-order, register-renaming design.
I.e., if you do:
LI reg1,immed
via
ADDI reg1,reg2,immed where reg2 == zero
the dependency-check logic *must* exist for the general case, and it must be
fast, and as Mitch describes, with minimal logic identifies the ADDI as
having no register input dependencies.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 27 Aug 1998 17:53:21 GMT
In article <487BC60C4F%rw@shadow.org.uk>, Rich Walker <rw@shadow.org.uk> writes:
|> In message <6s0cl1$8tb$1@ocean.cup.hp.com>
|> morrell@cup.hp.com (Michael Morrell) wrote:
|>
|> > John R. Mashey (mash@mash.engr.sgi.com) wrote:
|> > > 3) This has been sufficiently useful that I've often wished we'd done
|> > > something for floating point 0.0...
|> >
|> > I don't know about other architectures, but PA-RISC defines f0 as 0.0.
|>
|> And ARM provides 8 fp constants (0.0, 1.0, 0.5, 2.0, 3.0, 4.0, 5.0, 10.0)
|> which is about right for the ARM architecture.
|>
|> Mind you, it's still not a great advantage, as the ARM FP macrocell isn't the
|> world's fastest FP core...
Say some more: I don't have the FP details of ARM handy, but I thought
there were only 8 FP registers, which would make it unlikely that they
were dedicated to 8 constants...
Note the distinction between:
1) A register is hardwired to a constant, and participates in the usual
set of operations just like any other register.
2) Immediate operands for some instruction formats
a) Integers are straightforward, since small integer constants get
heavy use, and are straightforwardly expanded to full-width
integers, even with sign-extension, by replicating the sign bit.
b) Floating-point is trickier, since few instruction formats have
the space for general FP constants, and therefore a plausible
approach (which soudns like what ARM has done), is to use a small
number of bits to select among a few common FP constants ... which
is more work than sign-extension, but may be useful.
[In MIPS, I've several times wanted an FP Load-Immediate, for example].
3) Of course, choosing a small set of constants for special treatment
requires serious study of the static and dynamic program behavior,
also with reference to compiler technology, i.e., if you a program spends
a lot of time in loop that has:
for (i = 0; i < N; i++) {
...
a[i] = b[i] + constant*c[i];
}
In order of increasing speed, to get the constant
(1) One could execute a series of integer/FP Operations to create the
constant, depending on the CPU & the constant. This uses
instructions, but no data memory.
(2) One could have built a literal pool somewhere, and just load
the value into a register. This is often less instructions, but uses
data memory, and a memory reference, which may or may not be a
cache miss.
(3) One can have a floating-load-immediate, which is like (1), but
speeds up common cases, at the cost of an opcode or two.
(4) One could have immediate versions of FP operations, i.e.,
like multiply-immediate, which is fast, but may burn opcodes,
or may not.
The tricky part is that making such choices isn't obvious, because it
is also affected by such things as:
number of FP registers
optimization, global register allocation
In the sample code fragment above, a good compiler, with enough registers,
will have materialized the constant once in the setup for the loop,
and if you think that your FP code is dominated by such loops, then
any extra hardware for FP constants is probably a waste. If you code is
dominated by small functions with lots of branches, many constants,
and few low-level loops, then more of the time is going towards constant setup,
and it it might be worth some hardware to help.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 28 Aug 1998 01:56:02 GMT
In article <6s4dl4$sip$1@news.ox.ac.uk>, Thomas Womack
<mert0236@sable.ox.ac.uk> writes:
|> Organization: Oxford University, England
|>
|> John R. Mashey <mash@mash.engr.sgi.com> wrote:
|> : In article <487BC60C4F%rw@shadow.org.uk>, Rich Walker <rw@shadow.org.uk> writes:
|> : |> And ARM provides 8 fp constants (0.0, 1.0, 0.5, 2.0, 3.0, 4.0, 5.0, 10.0)
|> : |> which is about right for the ARM architecture.
|> : |>
|> : |> Mind you, it's still not a great advantage, as the ARM FP macrocell isn't the
|> : |> world's fastest FP core...
|>
|> : Say some more: I don't have the FP details of ARM handy, but I thought
|> : tehre were only 8 FP registers, which would make it unlikely that they
|> : were dedicated to 8 constants...
|>
|> No; effectively, there are 16 FP registers eight of which are devoted to
|> constants (if I recall some very old documents correctly). So it's the
|> second of your options.
I'm back home where my ARM book (van Someren & Atack) is:
it dates from 1994, and says that:
1) It has 8 fp registers, f0-f7.
2) "indicates that the argument may be either a valid floating-point
register or an immediate operand ... small constant from the
following list: 0.0, 1.0. 2.0, 3.0, 4.0, 5.0. 0.5, 10.0.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
Index
Home
About
Blog