United States Patent6609189
Kuszmaul , ; et al.August 19, 2003

Title

Cycle segmented prefix circuits

Abstract

The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular, the critical-path delays of many components in existing implementations grow quadratically with the issue width and the window size. This patent presents a novel way to reimplement these components and reduce their critical-path delay growth. It then describes an entire processor microarchitecture, called the Ultrascalar processor, that has better critical-path delay growth than existing superscalars. Most of our scalable designs are based on a single circuit, a cyclic segmented parallel prefix (cspp). We observe that processor components typically operate on a wrap-around sequence of instructions, computing some associative property of that sequence. For example, to assign an ALU to the oldest requesting instruction, each instruction in the instruction sequence must be told whether any preceding instructions are requesting an ALU. Similarly, to read an argument register, an instruction must somehow communicate with the most recent preceding instruction that wrote that register. A cspp circuit can implement such functions by computing for each instruction within a wrap-around instruction sequence the accumulative result of applying some associative operator to all the preceding instructions. A cspp circuit has a critical path gate delay logarithmic in the length of the instruction sequence. Depending on its associative operation and its layout, a cspp circuit can have a critical path wire delay sublinear in the length of the instruction sequence.


Inventors:Kuszmaul; Bradley C. (Woodbridge, CT), Henry-Kuszmaul; Dana Sue  (Woodbridge, CT)
Assignee:Yale University (New Haven, CT)
Appl. No.:267827
Filed:March 12, 1999

Current U.S. Class:712/23 712/3 708/421 708/632 712/12 712/221 
Current International Class:G06F 9/38 (20060101)
Field of Search:712/1,2,3,7,8,9,16,23,221,12 702/300,201,10,36,23,19,18,17,25,27,26,38,212,129,29,35,34 708/682,505,506,421,503,517,504,521,632,534 341/67,76,93 375/359 710/12,21,33,38,51,72,111,130,243,125,126

U.S. Patent Documents
3716851February 1973Neumann
4594655June 1986Hao et al.
5333135July 1994Wendorf
5390298February 1995Kuszmaul et al.
5544325August 1996Denny et al.
5560025September 1996Gupta et al.
5590283December 1996Hills et al.
5708841January 1998Popescu et al.
5822778October 1998Dutton et al.
5999961December 1999Manohar et al.
6151295November 2000Ma et al.
Other References
F Thomson Leighton, Introduction to parallel algorithms and architectures : arrays, trees, hypercubes, M. K. Publishers, CA, Sep. 1991,p32-44.
Primary Examiner: Pan; Daniel H.
Attorney, Agent or Firm:Pennie & Edmonds LLP

Parent Case Text



CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of Provisional Application Serial No. 60/077,669, filed Mar. 12, 1998, and Provisional Application Serial No. 60/108,318, filed Nov. 13, 1998, both of which are incorporated herein by reference in their entireties.

Claims


What is claimed is:
1. A circuit comprising: a plurality of multiple input multiplexers and multiple input function generators connected in series, each multiplexer having one input that is an identity function and an output that is selected by a segment bit, and each function generator having one input that is output of the preceding multiplexer and an output that is an input to the next multiplexer, and the output of the last function generator is connected as an input to the first multiplexer.

2. A circuit comprising: a plurality of modules connected in a tree structure, each module comprising: at least first and second function generators, first and second multiplexers and an OR gate, a first input and a first output connecting the module to an output and an input, respectively, of a module closer to the root of the tree structure, a second input and a second output connecting the module to an output and an input, respectively, of a module farther from the root of the tree structure and a third input and a third output connecting the module to an output and an input, respectively, of another module farther from the root of the tree structure, a fourth and a fifth input providing segment bits to the module from modules farther from the root and a fourth output providing a segment bit to a module closer to the root, said first and second inputs of each module being connected as inputs to the first function generator of that module and said second and third inputs being connected as inputs to the second function generator of that module, the first multiplexer having a first input that is an output of the first function generator and a second input that is the second input to that module and the second multiplexer having a first input that is an output of the second function generator and a second input that is the third input to that module, the first and second multiplexers having outputs that are selected by segment bits on the fourth and fifth inputs, the first output of each module being derived from the second multiplexer and one of the second and third outputs being derived from the first multiplexer, and the OR gate generating a logical OR of the fourth and fifth inputs and providing it to the fourth output, and the first output of at least one module closest to the root of the tree structure being connected to the first input of a module closest to the root of the tree structure.

3. A prefix computation circuit comprising: a plurality of input terminals to receive input data, a circuit configuration to perform prefix computation based on said input data, and a plurality of output terminals to provide output data based on said prefix computation, wherein the prefix computation is cyclic and segmented.

4. The circuit of claim 3 wherein the prefix computation is performed serially.

5. The circuit of claim 3 wherein the prefix computation is performed in parallel.

6. The circuit of claim 3 wherein there are N input terminals.

7. A prefix computation circuit comprising: a plurality of N input terminals to receive input data, a circuit configuration to perform prefix computation based on said input data, and a plurality of output terminals to provide output data based on said prefix computation, wherein the prefix computation is cyclic and segmented, and wherein the longest path through the circuit configuration is O (log N) circuit elements.

8. A prefix computation circuit comprising: a plurality of input terminals to receive input data, a circuit configuration to perform prefix computation based on said input data, and a plurality of output terminals to provide output data based on said prefix computation, wherein the prefix computation is cyclic and segmented, and wherein the circuit configuration computes a cyclic segmented prefix using a plurality of modules in which at least one module is connected to each of a plurality of other modules via a child connection comprising an output connected to an input of a child module, a first input connected to a first output of the same child module, and a second input connected to a second output of the same child module, and the child connections are ordered so that each child connection has a predecessor child connection whereby if the second input from a child connection's predecessor is true, then the output of the child connection is an identity value, and if the second input from a child connection's predecessor is false, then the output of the child connection is the same value as a combination, by an associative function, of the first input of the predecessor child connection and the output of the predecessor child connection.

9. A circuit comprising: a plurality of child modules and at least one cyclic prefix module, having a plurality of child connections, each child connection comprising: an output connected to an input of a child module, a first input connected to a first output of the same child module, and a second input connected to a second output of the same child module, the child connections being ordered so that each child connection has a predecessor child connection; the cyclic prefix module further comprising a circuit which computes a cyclic segmented prefix, whereby if the second input from a child connection's predecessor is true, then the output of the child connection is an identity value, and if the second input from a child connection's predecessor is false, then the output of the child connection is the same value as a combination, by an associative function, of the first input of the predecessor child connection and the output of the predecessor child connection.

10. A circuit, as claimed in claim 9, in which there are N child modules and the cyclic prefix module comprises a circuit in which the longest path through the circuit is O(log N) circuit elements.

11. A circuit, as claimed in claim 9, wherein each child module comprises an input which encodes an input value, a first output which encodes an output value and a second output which encodes a boolean value.

12. A circuit, as claimed in claim 9, wherein each of the child connections is identified by a different number from 1 to N, and for each leaf connection numbered I, if I is less than N the child connection is called the predecessor of child connection number I+1, and if I equals N the child connection is called the predecessor of child connection number 1.

13. A circuit as claimed in claim 9, in which one of the at least one cyclic prefix modules has a root connection, the root connection comprising an output, and an input, the root output of the module connected to the root input of the module, whereby the module performs a segmented parallel prefix computation, in which if the second input of child connection N is true, then the root output is the identity value, and if the second input of child connection N is false, then the root output is the same value as a combination, by the associative function, of the first input of child connection N with the first input of child connection 1, and the output of child connection 1 is the same value as the root input, and for each child connection other than child connection 1, if the second input from the child connection's predecessor is true, then the output of the child connection is the identity value, and if the second input from the child connection's predecessor is false, then the output of the child connection is the same value as a combination, by the associative function, of the output of the child connection's predecessor and the input of the child connection's predecessor.

14. A circuit, as claimed in claim 13, organized as a tree-structured circuit in which the longest path through the circuit is O(log N) circuit elements.

15. A circuit comprising: a plurality of leaf modules, each leaf module comprising: an input which encodes an input value, a first output which encodes an output value, and a second output which encodes a boolean value; a cyclic prefix module, having a plurality of leaf connections, each leaf connection comprising: an output connected to the input of a leaf module, a first input connected to the first output of the same leaf module, and a second input connected to the second output of the same leaf module, there being a total of N leaf connections, each leaf connection being identified by a different number from 1 to N and for each leaf connection numbered I, if I is less than N the leaf connection being called the predecessor of leaf connection number I+1, and if I equals N the leaf connection being called the predecessor of leaf connection number 1; the cyclic prefix module further comprising a circuit which computes a cyclic segmented prefix, whereby if the second input from a leaf connection's predecessor is true, then the output of the leaf connection is an identity value, and if the second input from a leaf connection's predecessor is false, then the output of the leaf connection is the same value as a combination, by an associative function, of the first input of the leaf connection's predecessor and the output of the leaf connection's predecessor.

16. A circuit, as claimed in claim 15, organized as a tree-structured circuit in which the longest path through the circuit is O(log N) circuit elements.

17. A circuit as claimed in claim 15, in which one of the cyclic prefix modules has a root connection, the root connection comprising an output and an input, the root output of the module connected to the root input of the module, whereby the module performs a segmented parallel prefix computation, in which if the second input of child connection N is true, then the root output is the identity value, and if the second input of child connection N is false, then the root output is the same value as a combination, by the associative function, of the first input of child connection N with the first input of child connection 1, and the output of child connection 1 is the same value as the root input, and for each child connection other than child connection 1, if the second input from the child connection's predecessor is true, then the output of the child connection is the identity value, and if the second input from the child connection's predecessor is false, then the output of the child connection is the same value as a combination, by the associative function, of the output of the child connection's predecessor and the input of the child connection's predecessor.

18. A circuit, as claimed in claim 17, organized as a tree-structured circuit in which the longest path through the circuit is O(log N) circuit elements.

Description

1. BACKGROUND OF THE INVENTION

This invention draws from two different areas: parallel-prefix circuits and superscalar-processor circuits. Parallel-prefix circuits have, in the past, been most often used for parallel computation in machine such as the Connection Machine CM-5
supercomputer. (See, for example, [25, 18, 16, 7, 2].) Throughout this patent, numbers enclosed in square brackets refer to the references cited in Section 4.4 below, each of which is incorporated by reference herein.

Superscalar processor circuits are used to implement processors that exploit instruction-level parallelism (ILP) using out-of order execution. Instruction-level parallelism is the parallelism that can be found in a serial instruction stream because certain of the serial chain of operations are actually independent. One strategy to exploit ILP is to execute the instructions in a different order from that specified by the original serial execution, hence "Out-of-order execution". A mechanism for out-of-order execution was first described in [44].

The rest of this section first describes parallel-prefix circuits and then describes today's superscalar-processor circuits.

Notation

We use the following notation: O: The "big-Oh" notation is used to indicate how fast a function grows, ignoring constant factors. Intuitively, we write "g(x) is O(f(x))" if g(x) grows no faster than f(x), for large x ignoring a constant multiplicative factor. .OMEGA.: The "big-Omega" notation is used similarly. Intuitively, we write "g(x) is .OMEGA.(f(x))" if g(x) grows no slower than f(x), for large x ignoring a constant multiplicative factor. .THETA.: The "big-Theta" notation is the intersection of big-Oh and big-Omega. g(x) is .THETA.(f(x)) exactly when g(x) is O(f(x)) and g(x) is .OMEGA.(f(x)). Intuitively, this means that g(x) grows at the same speed as f(x) for large x ignoring a constant multiplicative factor. See [5] for a complete and exact treatment of the O, .OMEGA., and .THETA. notation. lg I is the logarithm, base 2, of I. log I is the logarithm of I in the natural base e.

Thus, lg I=log.sub.2 I. Note that choice of the base of a logarithm makes only a difference of a constant factor. E.g., lg I.apprxeq.0.693 log I, which means that lg I is .THETA.(log I) and log I is .THETA.(lg I). In this document we generally use the log base two and we use binary trees. There may be engineering reasons to use a trees of higher degree because the base of the log changes which gives a constant-fold change in the performance. Our designs work for any base and any degree of the tree, including trees of mixed degree. Ceiling: We write .left brkt-top.x.right brkt-top. (pronounced "the ceiling of x") to be the smallest integer greater than or equal to x. append lists: If a and b are two lists, then {a,b} is the concatenation of the two lists. The set of integers x such that a.ltoreq.x.ltoreq.b is denoted by [a . . . , b]. Base numerals: When we write 0010.sub.2 the string "0010" should be interpreted as a base two number. Thus 1010.sub.2 =12.sub.8 =10.sub.10 =10.

1.1 Parallel Prefix

This section contains a tutorial on parallel-prefix circuits. First we define the prefix problem, then show how to implement it to run fast. Parallel-prefix circuit design is a technique that can often convert linear-time circuits into logarithmic-time circuits. (See, for example, [5] for a discussion of log-depth parallel-prefix circuits. Segmented parallel prefix circuits are the subject of an exercise of [5], and were implemented in the CM-5 supercomputer [25, 18, 16, 7].)

The prefix problem is as follows. One is given an associative operator .times. with an identity value, I. Given some inputs x.sub.0, x.sub.1, . . . , x.sub.n-1 we need to compute y.sub.0, y.sub.1, . . . , y.sub.n as: y.sub.i =x.sub.0.times. x.sub.1.times. . . . .times.x.sub.i-1, where y.sub.0 is defined to be the identity value for the .times.. (For example, if .times. is addition (and the identity for addition is 0), then y.sub.i =.SIGMA..sub.j=0.sup.i-1 x.sub.j.)

Sometimes one wants special initial and final values. One can formulate the prefix problem as having an initial value z that is passed to the circuit. In this case we have y.sub.i =z.times.x.sub.0.times.x.sub.1 . . . .times.x.sub.i-1. This can be viewed as the earlier case simply by renumbering the subscripts so that we have ##EQU1##

and then performing a parallel prefix on the x' values. Similarly, one would like to get a final output value w from the circuit which is defined to be w=z{circle around (.times.)}x.sub.0.times.x.sub.1 . . . .times.x.sub.n.

Again, this can be implemented by the earlier case by manipulating subscripts. We simply extend the subscript range to n+1 and compute w as y.sub.n+1.

The natural and easy thing to do is to compute the y.sub.i 's serially. First one computes each y.sub.i-1 and uses that to compute y.sub.i as ##EQU2##

FIG. 1 shows a circuit 10 that computes the prefix operation in linear time. Circuit 10 comprises a plurality of function generators 15, each having two inputs and one output. Each output is connected as one of the inputs to the next function generator and the other input is an x value. It is easy to see that prefix can be computed in time linear in n. It is surprising to many people that the prefix problem can be solved in time logarithmic in n by using a circuit structure known as parallel prefix. The next three sections review the parallel prefix circuit.

1.1.1 Log-Time Parallel Prefix

Before reviewing the construction of parallel-prefix circuits in general, we present an example. FIG. 2 shows a parallel-prefix circuit 20 that takes eight inputs, x.sub.0, x.sub.1, . . . , x.sub.7, and computes their prefix sums ##EQU3##

Circuit 20 comprises fourteen two-input adders 25 connected in a tree-like structure by signal wires 27 as shown. The inputs x.sub.i are provided at the bottom of the circuit, and the outputs y.sub.i come out the bottom, with output y.sub.i coming out just to the left of where input x.sub.i goes in. The identity (zero) goes in at the top and the sum of all the values (y.sub.8) comes out at the top. The critical-path length of this circuit is logarithmic in the number of inputs. This circuit can be laid out in VLSI using an H-tree layout [23] with a resulting area of about A=O(n.sup.2 b.sup.2) where b is the number of bits in the result y.sub.n. The resulting wire delay is about O(A). We can further optimize the parallel-prefix sum circuit of FIG. 2. If we use a redundant representation (such as the carry-save adder as used in Wallace-tree multipliers), with a single final sum at the end, we can perform the entire parallel-prefix sum in only O(log n) gate delays as opposed to O(log.sup.2 n). Furthermore, often the width of the data values is smaller at the inputs than at the outputs (for example, when the inputs x, to a sum are only one bit each, but the output is log n bits, a case which will come up later in this patent), then we can carefully size the ALUs so that they take just the right number of bits as inputs and produce the right number of bits as outputs, which will save area and power. One important special case is when the x.sub.i 's are one-bit each. The problem of summing one-bit inputs is often referred to as the enumeration problem.

In general, a parallel prefix circuit comprises a tree as shown in FIG. 3. The tree 30 comprises a plurality of treefix modules 35 at its vertices connected by signal wires 37 as shown. The x.sub.i values are input at the leaves of the tree (at the bottom of the figure). The results y.sub.i are also output at the leaves, adjacent to the corresponding x.sub.i 's. The identity value I is input at the root of the tree (at the top of the figure) and the result y.sub.8 of combining all the y.sub.i values is output at the root of the tree. The signal wires may be several bits wide in order to encode the necessary information. The values along each signal wire have been labeled. We use the notation P.sub.ij to indicate that a particular wire carries the value x.sub.i.times.x.sub.i+1.times. . . . .times.x.sub.j. Thus ##EQU4##

(If j<i then p.sub.ij is the identity value.) The circuit computes y.sub.j =p.sub.0j-1 for 0.ltoreq.j.ltoreq.7. (See [5] for a discussion of how to adapt the circuit of FIG. 3 to compute the special cases of w and z mentioned above.)

A treefix module 35 of FIG. 3 is shown in more detail in FIG. 4. Each treefix module has two function generators 42, 43, three inputs 44, 45, 46, and three outputs, 47, 48, 49 arranged in pairs of an input and an output. One pair connects to the circuit above, one to the lower-left and one to the lower-right. There are some integers j, k, and m, with j<k<m, such that the data coming from above will be P.sub.0j-1, the data coming from the lower-left will be p.sub.j,k-1 and the data coming from the lower-right will be P.sub.k,m-1. The treefix module then produces p.sub.j,m-1 which is output to above, p.sub.0j-1 which is output to lower-left and p.sub.0,k-1 which is output to lower-right. The reader can check that these are in fact the values carried on the wires of FIG. 3. The circuit to compute these values is very easy to design since

and

Although the tree in FIG. 3 has a branching factor of two (that is, it is a binary tree), all the parallel-prefix circuits described in this patent can be implemented with an arbitrary branching factor. The choice of an appropriate branching factor depends on the parameters of a particular technology. For illustration, we will show all our circuits with a branching factor of two.

1.1.2 Segmented Parallel Prefix

A segmented parallel-prefix circuit is also known. The segmented-prefix circuit is similar to the prefix circuit. A segmented-prefix operation comprises a collection of separate prefix operations over adjacent non-overlapping segments of the input x.sub.0, x.sub.1, . . . , x.sub.n-1. The way this works is that in addition to providing inputs x.sub.p to the prefix circuit we provide additional 1-bit inputs s.sub.i called the segment bits. The segment bits indicate where a new segment is about to begin. The output is ##EQU5##

where k.sub.i =max{0, max{k<i:s.sub.k =1}}.

Thus if we have ##EQU6##

A linear-time segmented parallel-prefix circuit 50 is shown in FIG. 5. Circuit 50, comprises a plurality of two-input function generators 55. One input to each function generator is the output of a two-input multiplexer (MUX) 58, which output is selected by a segment bit. One input to each MUX is the identity function. The other input is the output of the preceding function generator. The other input to each function generator is an x value. This is similar to the circuit of FIG. 1 except that MUXes have been added to handle the segment bits.

The segmented parallel-prefix tree circuit has the same structure as the ordinary parallel-prefix tree, except that we modify the treefix module to compute an additional segmentation signal s.sub.j,k-1 , which is passed up the tree. The value s.sub.j,k-1 indicates if any of the segments bits are equal to one. FIG. 6 shows a segmented parallel-prefix circuit 60 with eight leaf nodes (n=8). The circuit comprises a plurality of treefix modules 65 at the vertices connected by signal wires 67 as shown. The tree uses the slightly modified treefix module 65 shown in FIG. 7. Circuit 65 comprises two function generators 72, 73, two multiplexers 75, 76 and an OR gate 78. The circuit also comprises two inputs, 81, 82 and one output 83 for the segment bits and the same three inputs 84, 85, 86 and the three outputs 87, 88, 89 as in the treefix module of FIG. 4. An OR-gate 78 computes the segmentation signal that will be passed up. Multiplexer (MUX) 75 operates so that no value will be added from above to the value from the left subtree if there is a segment bit in the left subtree. MUX 76 operates so that no value will be added from the left subtree if there is a segment bit in the right subtree.

A segmented parallel-prefix circuit can be viewed as a prefix computation on values which are pairs: <P,S> where P is the value being operated on by the original operator, and S is a segmentation bit. We have the following operator: ##EQU7##

We can show that this operator is associative. To do this we show that

Proof: If S.sub.c =1 then ##EQU8##

Thus, a segmented parallel-prefix is associative, and our tree switch circuit can be viewed as an ordinary parallel-prefix circuit with a certain associative operator. (See [5, Exercise 30-1].)

1.1.3 Variations on Prefix Circuits

Often, the prefix is modified slightly from the formulae given above. For example, an inclusive prefix has ##EQU9##

instead of ##EQU10##

An inclusive segmented prefix has ##EQU11##

instead of ##EQU12##

Sometimes it is useful to have a backwards prefix operation. An exclusive backwards prefix operation is ##EQU13##

Similarly, an inclusive version can be made with and without segmentation.

When implementing these circuits, one must be careful to get the "fencepost" conditions right. That is, the lower and upper bounds to the summation index must be thought through carefully to avoid designed-in errors in the circuits.

Note: Sometimes prefix circuits are called "scan chains" (See, e.g., [9,27].)

1.2 Superscalar Processors

The second background area for this invention is superscalar processors. FIGS. 8-10 show a six-stage pipeline processor 100 illustrative of how today's superscalar processors are organized. The stages, are Fetch, Rename, Analyze, Schedule, Execute, and Broadcast. An example set of three instructions is shown as they propagate through each of the stages.

The Fetch stage comprises an arithmetic logic unit (ALU) 105, a program counter 110, a multiplexer 115, and a pipeline register 120. Program counter 110 keeps track of the next instruction to be executed. The program counter's value is used as a memory address to fetch several words from an instruction cache ("I-cache" (not shown).) In this example, four instructions are fetched from the I-cache. The new program counter is computed by adding a value to the old program count. The value added depends on prediction logic (labeled "PREDICT".) If the prediction logic predicts that there are no taken branches in the next four instructions, then the new program count is the old program count plus 4. If the prediction logic predicts that one of the instructions is a taken branch, then the predictor selects the branching instruction, and the "offset" field of the instruction is added to the old program count to get the new program count. The logic for handling absolute branches and mispredictions is not shown here. The four instructions are latched in pipeline register 120 between the Fetch and Rename stages.

The Rename stage comprises renaming logic 125 including a renaming table 130 and a pipeline register 135. The rename stage takes instructions and rewrites them to make it easier to execute them in parallel. In the example shown the instructions are 0: R0:=R1+R2 1: R3:=R0/R1 2: R0:=R1/R5

Note that Instruction 2 is logically independent of Instruction 1, and they may run in parallel. Instruction 2 writes to Register R0, however, and if that write happens before Instruction 1 reads from Register R0, then the wrong value will be used by Instruction 1. This problem is referred to as a "write-after-read" hazard. To allow parallel execution of Instructions 1 and 2 without suffering from the write-after-read hazard, the renaming stage rewrites the instruction to use a different set of registers, called tags. The program is transformed into 0: T42:=T35+T40 1: T43:=T42/T35 2: T44:=T35/T41

In this case we assumed that the most recent writer of Register R1 had been renamed to write to Tag T35, the most recent R2 to T40, and the most recent R5 to T41. Three new tags are allocated: T42, T43, and T44, and the registers are renamed appropriately. To facilitate this renaming a Renaming Table is used to provice a mapping from register names to tag names.

FIG. 11 shows the contents of the renaming table before and after renaming each of the three instructions mentioned above. Note that the table renames all the registers, not just the ones mentioned earlier. Thus, initially R0 is renamed to T30, and then after renaming Instruction 0, R0 is renamed to T42, and then after renaming Instruction 2, R0 is renamed to T44. Register R3 was initially renamed to T25 and was renamed to T43 after Instruction 1. The renaming for R2 was not affected by these three instructions, since none of the instructions mentioned R2 as a destination register.

The circuit for identifying free tags is not shown in FIG. 11. Such a circuit would, in this case, identify up to four unused tags. In our example we assume that the four unused tags are T42 through T45. Some systems allocate and deallocate the tags so that contiguous tags are always allocated, whereas some systems can allocate an arbitrary set of four tags.

The exact implementation of the Rename stage varies from processor to processor. Although we have shown the Rename stage renaming registers to successive tags (T42, T43, and T44), in general, there is no requirement that sequentially allocated tags have sequential numbers. Some superscalar processor implementations do have such a requirement. Others do not. Those that do not require circuitry to identify on every clock cycle up to four unused tags out of the list of unused tags. The renaming table also need not be directly addressed by the logical register number. The Digital Alpha 21264 compresses entries out of the renaming table when they are no longer needed. This means that the table requires an associative lookup, instead of a direct lookup [8].

After renaming, instructions are sent via pipeline register 135 to the Analyze Dependencies stage. This stage includes a reordering buffer 140 and a pipeline register 145. In this stage, instructions whose inputs are all available are identified as ready to run. Reordering buffer 140 keeps track of this dependency information. FIG. 9 illustrates some of the instructions stored in the reordering buffer but does not depict additional information that is ordinarily kept there. Instructions are stored in the reordering buffer in sequential order. The "Old" and "New" pointers point at the oldest and the newest instruction in the sequence, respectively. Buffer entries to the left of "Old" and to the right of "New" are not currently in use. A signal is produced for each instruction in the buffer indicating whether it is ready to run, and if so, what execution resource it needs. in FIG. 9 an arrow 143 from buffer 140 to register 145 identifies those instructions that are ready to run, and the absence of an arrow identifies those instructions not ready to run. On the arrow, the labels "Mem", "Add", or "Div" indicatie whether that instruction needs to access memory, an ALU capable of adding, or an ALU capable of dividing. In our example, the instruction at buffer entry 38 is ready to add, the one at entry 41 is ready to access memory, the one at entry 42 is ready to add, and the one at entry 44 is ready to divide. This information is stored in pipeline register 145
between the Analyze Dependencies stage and the Schedule stage.

The Schedule stage comprises a scheduler 150 and a pipeline register 155. This stage assigns specific execution resources to the collection of instructions that are ready to run. In our example, the instructions at entries 38, 41, and 44 are assigned to particular functional units, and the instruction at entry 42 is not scheduled. Scheduler 150 obtains the actual operands from reordering buffer 140, and feeds them via pipeline register 155 to the appropriate functional units in the Execute stage.

The Execute stage comprises an ALU 160 for adding, an ALU 165 for dividing, an interface to memory (168), and a pipeline register 170. This stage executes each of the instructions provided by scheduler 150 so as to actually perform the arithmetic or memory operation. When the result is computed, it notifies the Broadcast stage 175.

The Broadcast stage takes computed results from the Execute stage and broadcasts them back to the Analyze Dependencies stage in FIG. 9. All instructions in the reordering buffer associatively listen to the broadcast. As a result of the broadcast, more instructions may become ready to execute because their dependencies have been satisfied.

Different processors reuse entries in their reordering buffer differently. Those that assign tags serially can use each assigned tag as a direct address into the reordering buffer at which to store the renamed instruction. This is the case in FIG. 9. These processors including the Alpha 21264 write values to a canonical register file by compressing the entries in the reorder buffer so that the instruction in Buffer Entry 0 is always the oldest [8]. When scaled up, the circuitry used in the
21264 for compressing the window requires large area and has long critical-path lengths, however.

When the reorder buffer fills up, some processors exhibit performance anomalies. Some processors empty the reorder buffer, instead of wrapping around, and commit all results to the register file before they start back at the beginning. Some processors wrap around, but start scheduling newer instructions instead of older ones, which can hurt performance. (See [33] for an example of a circuit that works that way.)

In some processors there is an additional decode stage after the fetch stage, but before the rename stage. This stage may interpret a complex instruction stream, translating it to a simpler instruction stream. The Intel Pentium Pro does this.

In most processors there is bypass logic to allow data to move directly from the Broadcast stage to the Execute stage, bypassing the Analyze Dependencies stage and the Schedule stage. We do not show that bypass logic here, but that logic also has long critical-path lengths and large area in today's designs.

1.2.1 Microprocessor Performance

The standard model for modeling the performance of a microprocessor [14] says that the time to run a program is T=N.multidot.CPI.multidot..tau. where N is the number of instructions needed to run the program, CPI is the number of clock periods per instruction, and .tau. is the length of a clock period in seconds, i.e. the cycle time.

The value of .tau. is determined by the critical-path length through any pipeline stage, that is the longest propagation delay through any circuit measured in seconds. Propagation delay consists of delays through both gates and wires, or alternately of delays through transistors driving RC networks. We are not changing N or directly changing CPI, but rather we aim to reduce the clock cycle by redesigning the processor to use circuits with reduced critical-path length.

One way to avoid slowing down the clock is by breaking down the processor into more pipeline stages. Increasing the number of pipeline stages offers diminishing returns, however, as pipeline registers begin to take up a greater fraction of every clock cycle and as more clock cycles are needed to resolve data and control hazards. In contrast, shortening the critical path delay of the slowest pipeline stage translates directly into improved program speed as the clock period decreases and the other two parameters remain unchanged.

The critical-path delays of many of today's processor circuits do not scale well. For example, Palacharla, Jouppi, and Smith [32J find that many of the circuits in today's superscalars have asymptotic complexity .OMEGA.(I.sup.2 +W.sup.2), where I is the issue width (i.e., the maximum number of instructions that are fetched in parallel from the cache) and W is the window size (i.e., the maximum number of instructions within the processor core) of the processor. While delays appear to be practically linear for today's processors, optimized for I equal to four, and W in the range of 40 to 56, the quadratic terms appear to become important for slightly larger values of I and W. (Note that for today's processors with large W the window is typically broken in half with a pipeline delay being paid elsewhere. An HP processor sets W=56 [10]. The DEC 21264 sets W=40 [8]. Those systems employ two windows, each half size, to reduce the critical-path length of the circuits. Communicating between the two halves typically requires an extra clock cycle.) Some of today's circuits have critical-path length that grows at least as fast as .OMEGA.(I.sup.4) and have many components with area and power consumption that grows quadratically .THETA.(I.sup.2). (See [30,32,1].) Increasing issue widths and increasing window sizes are threatening to explode the cycle time of the processor.

2. SUMMARY OF THE INVENTION

We have found that a very useful improvement to the parallel prefix circuit can be made by allowing the prefix operation to "wrap around" from the end back to the beginning. This extension is called Cyclic Segmented Parallel Prefix (CSPP.)

Further, such CSPP circuits are useful to improve the performance of many of the circuits in a superscalar processor. In some situations, it is especially advantageous to use CSPP because it avoids performance penalties in certain situations, for example, when the window fills up or wraps around. This part of the invention also contains a number of other novel circuits that can improve the performance of superscalar processors but that are not prefix circuits.

Further, it is possible to completely reorganize the processor to take advantage of CSPP to do all the scheduling and data movement. We call the resulting processor an Ultrascalar processor.

The circuits used in our invention all grow much more slowly than those of the superscalar processor, with gate delays of O(log I+log W) and wire delays of O(I+W) for memory bandwidth comparable to today's processors. The asymptotic advantage of the Ultrascalar over today's circuits translates to perhaps an order-of-magnitude or more advantage when W is on the order of several hundreds or thousands, a design point advocated by [35].

The Ultrascalar processor, our completely reorganized architecture, breaks the scalability barrier by completely restructuring the microarchitecture of the processor. The Ultascalar turns the processor's datapath into a logarithmic depth network that efficiently passes data from producer instructions to consumer instructions within the reordering window. The network eliminates the need for separate renaming logic, wake-up logic, bypass logic, and multi-ported register files.

3. BRIEF DESCRIPTION OF THE DRAWINGS

These and other objects, features and advantages of our invention will be more readily apparent from the following Detailed Description of our invention in which

FIG. 1 is a schematic representation of a linear-time prefix circuit of the prior art.

FIG. 2 is a schematic representation of a parallel-prefix summation circuit of the prior art.

FIG. 3 is a schematic representation of a parallel-prefix circuit of the prior art.

FIG. 4 is a schematic representation of a treefix module of the prior art.

FIG. 5 is a schematic representation of a linear-time segmented prefix circuit of the prior art.

FIG. 6 is a schematic representation of a segmented parallel prefix circuit of the prior art.

FIG. 7 is a schematic representation of a segmented treefix module of the prior art.

FIGS. 8, 9 and 10 are a schematic representation of a superscalar parallel processor of the prior art.

FIG. 11 is a schematic representation of the buffer of FIG. 9.

FIG. 12 is a schematic representation of an illustrative embodiment of a linear-time cyclic segmented prefix circuit of the present invention.

FIG. 13 is a schematic representation of an illustrative embodiment of a cyclic segmented parallel prefix (CSPP) circuit of the present invention.

FIG. 14 is a schematic representation of an H-tree layout of the CSPP circuit of the present invention.

FIG. 15 is a schematic representation of a first embodiment of a treefix module for use in the invention.

FIG. 16 is a schematic representation of a second embodiment of a treefix module for use in the invention.

FIG. 17 is a schematic representation of a preferred embodiment of a CSPP circuit of the present invention.

FIG. 18 is a schematic representation of a third embodiment of a treefix module for use in the present invention.

FIG. 19 is a schematic representation of another embodiment of a CSPP circuit of the present invention.

FIG. 20 is a schematic representation of an instruction fetch unit.

FIG. 21 is a schematic representation of an instruction fetch unit using trace cache.

FIG. 22 is a schematic representation of an instruction fetch unit using a parallel prefix.

FIG. 23 is a schematic representation of an instruction fetch unit using a parallel prefix and a network.

FIG. 24 is a schematic representation of a pointer jumping trace cache.

FIG. 25 is a schematic representation of a finite state machine.

FIG. 26 is a schematic representation of computing the 9.sup.th state in parallel.

FIG. 27 is a schematic representation of the composition operator.

FIG. 28 is a schematic representation of an implementation of the composition operator.

FIG. 29 is a schematic representation of partial renaming in hardware.

FIG. 30 is a schematic representation of rename logic (part 1).

FIG. 31 is a schematic representation of rename logic (part 2).

FIG. 32 is a schematic representation of quickly find the index of the first "1" bit.

FIG. 33 is a schematic representation of wake-up logic.

FIG. 34 is a schematic representation of a scalable memory system.

FIG. 35 is a schematic representation of a circuit for computing when all previous stores are completed.

FIG. 36 is a schematic representation of a datapath for memory renaming.

FIG. 37 is a schematic representation of an example of invalidating redundant entries.

FIG. 38 is a schematic representation of a circuit that invalidates redundant entries.

FIG. 39 is a schematic representation of a circuit for scheduling several kinds of resources.

FIG. 40 is a schematic representation of the data movement network.

FIG. 41 is a schematic representation of moving data.

FIG. 42 is a schematic representation of a linear-time ring datapath.

FIG. 43 is a schematic representation of how instructions specify a wiring.

FIG. 44 is a schematic representation of an overview of a linear-time datapath.

FIG. 45 is a schematic representation of an execution station for the linear-time datapath.

FIG. 46 is a schematic representation of a switching graph.

FIG. 47 is a schematic representation of a log-depth cyclic datapath.

FIG. 48 is a schematic representation of a log-depth cyclic datapath.

FIG. 49 is a schematic representation of one slice of the logarithmic-depth network.

FIG. 50 is a schematic representation of the circuitry inside each switch slice.

FIG. 51 is a schematic representation of the flip-flop-free datapath converted to an acyclic tree.

FIG. 52 is a schematic representation of the execution station for an acyclic tree datapath.

FIG. 53 is a schematic representation of the cyclic execution station.

FIG. 54 is a schematic representation of the commitment unit.

FIG. 55 is a schematic representation of the overall layout of the ultrascalar.

FIG. 56 is a schematic representation of a VLSI layout.

FIG. 57 is a schematic representation of the optimized datapath.

FIG. 58 is a schematic representation of circuitry that computes the types of values propagating up the tree.

FIG. 59 is a schematic representation of one switch of FIG. 58.

FIG. 60 is a schematic representation of the circuitry that updates the committed register file.

FIG. 61 is a schematic representation of one module of FIG. 60.

FIG. 62 is a schematic representation of the circuitry that propagates register bindings up.

FIG. 63 is a schematic representation of one module of FIG. 62.

FIG. 64 is a schematic representation of the circuitry that propagates register bindings.

FIG. 65 is a schematic representation of one module of FIG. 64.

4. DETAILED DESCRIPTION

4.1 Cyclic Segmented Parallel Prefix

We can define a cyclic segmented prefix circuit by using modulo arithmetic for the indices of x and s. We let ##EQU14##

where k.sub.i =max{k any integer: k<i and s.sub.(k mod n) =1}, and i mod n=min{m.gtoreq.0: (i-m) is a multiple of n.} This allows the values to "wrap around." Thus -4 mod 10=6. Note that k.sub.i may be negative.

Returning to our previous example, if the input to a cyclic parallel prefix is ##EQU15##

A linear-time cyclic segmented prefix circuit 200 is shown in FIG. 12. Circuit 200 comprises a plurality of two-input function generators 205. One input to each function generator is the output of a two-input multiplexer 210, which output is selected by a segment bit. One input to each multiplexer is the identity function. The other input is the output of the preceding function generator. The other input to each function generator is an value. Circuit 200 is similar to the linear-time prefix circuit 50 of FIG. 5 except that circuit 200 is wrapped around to be a ring.

A cyclic segmented parallel-prefix circuit can be implemented by starting with a segmented parallel prefix tree, and modifying it as follows: connect the output at the root of the tree to the input at the root, discarding the segment bit. FIG.
13 shows a cyclic segmented parallel prefix circuit (CSPP) 220 comprising a plurality of treefix modules 225 connected as shown. The individual treefix modules 225 are the same as treefix modules 65 of the segmented acyclic circuit shown in FIGS. 6 and
7. Note that the tree produces some value even if none of the segment bits are equal to one (there is no cycle in the combinational logic), but the value produced is not the indicated by the formula above.

FIG. 14 shows a CSPP circuit 230 laid out in an H-tree layout. The subtree that is drawn on the left in FIG. 13 is at the top of FIG. 14 and the subtree on the right is at the bottom of FIG. 14. FIG. 14 comprises six treefix modules 235, each including two function generator 240 and two multiplexers 245. The leaf modules also include an OR gate 250. The root node of the tree has been optimized to take advantage of the fact that at least one of the segment bits must be set to one.

4.1.1 Variations of CSPP

Just as ordinary prefix circuits can have inclusive and/or backwards prefix operations, so also inclusive and/or backwards prefix operations can be implemented with cyclic segmented parallel-prefix circuits.

It is clear that there are many alternatives for building prefix circuits. One can build a linear-time prefix chain, or one can build a separate tree to compute each output, possibly sharing some common subexpressions, or one can build a single bidirectional tree as described here. One can vary the degree of the tree, so that instead of each node having two children, each node could have three or more children. One could vary the timing methodology. For example, in a self-timed system data traveling along the tree links would carry its own clocking information. Or one could build a globally clocked system. Other timing schemes would work as well. One could vary the technology so that the circuits are implemented in CMOS or Bi-CMOS or other logic families, and the communications links could be implemented with traces on circuit boards, or twisted pair cables, or coaxial cables, or fiber optic cables, or other technologies. Standard circuit optimization techniques can be used to improve the speed of the circuit. For example, in some technologies, NAND gates are faster than any other gate. In such technologies, the circuits using AND and OR gates could be transformed into circuits using only NAND gates. Another example is shown in FIG. 14 which shows a CSPP circuit of the AND function. In that case, the MUX is simplified to an OR gate.

4.1.2 Examples of CSPP

This section describes a number of CSPP circuits that we use in our designs. At block-level, all the CSPP circuits that we are about to describe are identical to each other. The different CSPP circuits only differ in their implementation of the treefix modules. In the following sections, we will specify different instances of these CSPP circuits by defining the circuit's inputs and wire widths.

The first example is the "oldest" CSPP circuit. The "oldest" CSPP circuit passes to each leaf node in the CSPP tree the input value of the oldest node in its segment. This circuit is useful for passing data between producer and consumer instructions, for example. The "oldest" CSPP circuit uses the operator a.times.b=a. FIG. 15 illustrates the circuitry inside each treefix module for an "oldest" CSPP circuit 260. The circuitry comprises first and second multiplexers 270 and an OR gate
275. The inputs to the OR gate are the segment bits from the left and right subtrees. The output of the OR gate is a segment bit that is applied to the next treefix module in the tree. The segment bit from the right subtree also controls the upper multiplexer to determine whether signals from the left subtree or the right subtree are passed to the next module. The segment bit from the left subtree controls the lower multiplexer to detect whether signals from the left subtree or the next module are passed to the right subtree.

Another very useful type of a CSPP circuit is the conjunction CSPP circuit. The conjunction CSPP circuit tells each leaf node in the CSPP tree whether all preceding leaf nodes within its segment have met a certain condition. For example, a conjunction CSPP circuit can tell each instruction within an instruction window whether all preceding instructions have computed. The conjunction CSPP circuit uses the operator a.times.b=a{character pullout}b. FIG. 16 shows the circuitry inside each treefix module for a conjunction CSPP circuit 280. Circuit 280 takes a k-bit input, and performs bitwise logical AND operations on those inputs. This can be viewed as performing k independent AND operations using a single, shared, segment bit. Circuit
280 comprises first and second AND gates 285, first and second multiplexers 290 and an OR gate 295. The inputs to OR gate 295 are the segment bits from the left and right subtrees. The output of the OR gate is a segment bit that is applied to the next treefix module in the tree. The segment bit from the right subtree also controls the upper multiplexer and the segment bit from the left subtree controls the lower multiplexer. Signals from the right subtree are one input to the upper multiplexer and the logical AND of the signals from the left and right subtrees is the other input. Signals from the left subtree are one input to the lower multiplexer and the logical AND of the signals from the next module and from the left subtree is the other input.

Of course, the circuitry within each treefix module can be optimized. For example, FIG. 17 shows an optimized, 8-leaf-node conjunction CSPP 300 with input and output width of 1 (k=1). Circuit 300 comprises six treefix modules 305 each including two-input AND gates 310. One of the two inputs to each AND gate is the output of an OR gate 315. The leaf modules also include an additional OR gate 320 which ORs together the segment bits from the left and right subtrees.

A disjunction CSPP circuit indicates whether any previous node within the segment has met a certain condition. This can be implemented by inverting the x.sub.i inputs and the y.sub.i outputs for the conjunction CSPP circuit, or it can be implemented directly. FIG. 18 shows an unoptimized treefix module 330 for a disjunction CSPP circuit. Of course this module can also be optimized from ##EQU16##

Module 330 comprises first and second OR gates 335, first and second multiplexers 340, and a third OR gate 345. The inputs to OR gate 345 are the segment bits from the left and right subtrees. The output of the OR gate is a segment bit that is applied to the next higher treefix module in the tree. The segment bit from the right subtree also controls the upper multiplexer 340 and the segment bit from the left subtree controls the lower multiplexer 340. Signals from the right subtree are one input to the upper multiplexer and the logical OR of the signals from the left and right subtrees is the other input. Signals from the left subtree are one input to the lower multiplexer and the logical OR of the signals from the next module and from the left subtree is the other input.

Another important type of a CSPP circuit is the summation CSPP circuit. The summation CSPP circuit tells each leaf node the sum of the preceding leaf nodes' inputs within its segment. A summation CSPP can be used, for example, to sum up the number of requests for a resource. A straightforward implementation of a summation CSPP uses a lookahead adder for each .times. operator. A more efficient implementation can be achieved by using carry-save adders. FIG. 19 shows one possible implementation of a treefix module 360 for a summation CSPP circuit. Module 360 comprises first and second pairs of cascaded carry-save adders (CSA) 365, first and second pairs of multiplexers 370 and an OR gate 375. The inputs to OR gate 375 are the segment bits from the left and right subtrees. The output of the OR gate is a segment bit that is applied to the next treefix module in the tree. The segment bit from the right subtree also controls the upper pair of multiplexers and the segment bit from the lower subtree controls the lower pair of multiplexers. Signals from the right subtree are one input to the upper pair of multiplexers and the other input is the output of one of the cascaded pairs of CSA. Signals from the left subtree are one input to the lower pair of multiplexers and the other input is the output of the second pair of cascaded CSAs.

This CSPP circuit represents each sum by two partial sums that add up to the sum. Each pair of CSAs reduces four partial sums to two partial sums. Each CSA takes three n bit inputs and produces two n+1 bit outputs. As partial sums propagate up the tree, they increase in size by one bit at each treefix module. (In some situations, it is known that the output requires at most n bits to represent, in which case the high order bits can be removed from the circuit. An example of such a situation is adding together 32-bit numbers modulo 2.sup.32 to get a 32-bit sum for an address calculation.)

The advantage of using CSA circuitry is that the entire summation of n numbers to produce a k-bit result can be performed with critical-path delay .THETA.(log n+log k) instead of .THETA.(log n.multidot.log k) for adders using a nonredundant representation.

4.2 Parallel Prefix in a Conventional Superscalar Processor

This section describes a number of novel circuits that can improve the performance of traditional superscalar processors. These novel circuits can plug into a traditional superscalar processor, replacing an existing component such as the scheduler. Most of the circuits are instances of CSPP circuits.

4.2.1 Fetch Logic

CSPP can be used to speed up the fetch stage of a conventional superscalar processor. FIG. 20 shows a typical simple fetch stage of the prior art. A fetch unit 400 provides an address to a memory 405. The memory (typically an instruction cache) responds with four instructions, labeled "data from memory." The fetch unit then takes those instructions and provides them to a decode stage (not shown). Meanwhile the fetch unit computes a new program count (PC) based on the branch prediction rules and stores it in a PC register 410. The fetch unit also receives information about how branches actually were taken, and uses that to adjust the program count. The contents of the PC register are then used by the fetch unit to fetch the next block from memory. The problem with this system is that only one taken branch can be handled per clock cycle. As soon as the processor takes a branch, the rest of the data from the cache line is no longer needed.

FIG. 21 shows a prior art instruction fetch unit using a trace cache 420. The unit further comprises a controller 425, a memory 430 and a register 435. In this system register 435 holds the program count (PC) and the branch history (BH). The PC and BH are provided to the trace cache 420, which then fetches eight successive instructions that were executed in the past. Those eight instructions do not necessarily have to have been contiguous in memory. They are simply eight successive instructions that were once executed before when the PC and BH matched the current state. Meanwhile, controller 425 watches how the branches actually are taken, and adjusts the PC and BH. If the controller determines that the trace cache does not have the right data for a particular combination of PC and BH, then the controller fetches four instructions at a time from memory 430 to refill the trace cache. Thus, the trace cache can handle multiple taken branches per clock cycle. However, this trace cache has the problem that only one block can be fetched from memory at a time, when the trace line is being constructed.

FIG. 22 shows an instruction fetch unit that can fetch more than one block at a time. In this model, the controller produces an initial PC. A parallel-prefix summation circuit 450 computes all the subsequent PC's, and fetches the instructions I.sub.0 through I.sub.7 from a multiported memory 455. A trace cache 460 keeps track of whether the instruction is a taken branch or not. If it is an untaken branch, the next instruction will be found at PC.sub.i-1 =PC.sub.i +1. If the branch is taken, an offset o.sub.i is predicted, and the next instruction will be found at PC.sub.i+1 =PC.sub.i +o.sub.i. If parallel-prefix summation circuit 450 is an unsegmented acyclic circuit, it is set up with ##EQU17##

and y.sub.i, is the predicted value of PC.sub.i.

By adding segmentation in circuit 450, we can handle absolute unconditional branches. We use the following: ##EQU18##

FIG. 23 shows an instruction fetch unit similar to the one in FIG. 22 except with a scalable memory system. Summation circuit 450 and cache 460 are the same as the corresponding elements in FIG. 22. In the circuit of FIG. 23 the memory addresses are provided to an interconnection network 470 which routes the requests to various memory modules 475, and the memory modules respond with data which is routed back to the various slots of the fetch units.

There is a remaining problem with the systems of FIGS. 22 and 23. That is that for trace lines of length k, with B bits of branch-prediction history the memory requirements can grow to be .THETA.(k2.sup.B) in the worst case. This is because for each instruction, and each combination of branch-prediction history there may be a unique trace cache entry of k elements. Some recent research indicates that the problem is not bad enough to make the system infeasible, however [46, 38, 34 ], at least for small values of k and B. For larger values of k and B the problem may become worse.

Another issue with this is that we cannot generally compute the address of later instructions until we have fetched previous instructions, since we won't know ahead of time if the instruction is a branch, and we may not know the branch address until even later when the instruction has computed. For conditional branches, we may not know whether the branch is taken until the previous instructions have finished executing. Unless this problem is handled, the processor may serialize often.

Today's processors cope with this problem by using speculation. The idea is to guess whether a particular instruction will be a branch, and if it is, where it branches to. If the guess is wrong, then the work must be undone. If the guess is right, then by the time the guess is verified, the processor needs only to commit its changes, and it needs not recompute them.

A series of not-taken branches can thus be handled in parallel. For not-taken branches, the instructions are stored in sequential memory locations. The instructions can be fetched in a single block, and we use a parallel-prefix circuit to quickly communicate branch-taken information. We would like the circuit to handle correctly predicted taken branches with short circuit delay, however. To solve this we need accurate prediction, we need to communicate the predictions to various components of the processors, and we need a way to use the prediction information to allow us to compute the addresses of the several different non-contiguous parts of the program, in parallel, so that we can fetch them all at once.

The system of FIG. 24 solves this problem. One advantage of this system is the memory requirements grow to be only .THETA.((log k)2.sup.B) a savings of (k/log k). As a matter of practice, it is likely that the savings will be even better than the k log k factor, because fewer branch-prediction bits will be needed to achieve the same success rate, reducing the 2.sup.B component as well. FIG. 24 comprises a trace creator 500, lg k different stages 505 of memory, and a program count register
510. Since in our example k=8, there are three stages, labeled P1, P2, and P3. Each stage is a memory that takes a PC (and a branch-prediction history, which we don't show for simplicity.) Given an address x, we write P1[x] to indicate the value of P1
at index x. We write PC.sub.i to mean the address of the ith instruction in the predicted dynamic instruction sequence after instruction PC. Thus, PC.sub.0 =PC, PC.sub.1 is the address of the first instruction after PC, and PC.sub.4 is the address of the fourth instruction after PC. Note that if the instruction at PC.sub.i is not a branch, then PC.sub.i+1 =PC.sub.i +1. If the instruction at PC.sub.i is a branch, however, then PC.sub.i+1 may be quite different from PC.sub.i. The memory is set up so that P1[x] contains the address of the instruction predicted to follow x, that is x.sub.1 ; and P2[x] contains the predicted address of the 2nd instruction after x, that is x.sub.2 ; and P[4] contains the address of the 4th instruction after x, that is x.sub.4. Now to compute PC.sub.7, we can simply read it out of the memory stages with a logarithmic number of lookups as PC.sub.7 =P1 [P2[P4[PC]]].

Generally to compute PC.sub.i, we write i down as a sum of distinct powers of two: i=2.sup.a +2.sup.b +2.sup.c and then read Pa[Pb[Pc[PC]]].

Note that there are many ways to achieve high bandwidth and minimize contention: Use many banks of memory. Use Ranade's algorithm on a butterfly switch with hashing of memory address [37]. Use the Ultracomputer fetching algorithm [11]. (There is no direct relationship between the Ultracomputer and the Ultrascalar processor.) Build a sorting network which sorts the requests by their address. After the requests are sorted, duplicates can be identified. Only a single request to any given memory is actually sent to the memory. After the request has been filled, then use a segmented parallel prefix operation to distribute the data to the duplicate requesters (see [20,21].)

To fill in a trace cache there are many options: Use static branch prediction (initialize the table once and then change it very seldom.) When a trace is discovered to be wrong, repair only that trace. Other traces that are affected by the incorrectly predicted branches are not necessarily repaired. (For the system of FIG. 24, repairing one trace often causes other traces to be repaired as well, but we do not necessarily need to track down every incorrect trace entry.) For each entry Pi[j] of the table of FIG. 24 one could maintain a list of the entries in level P(i-1) that refer to Pi[j]. When we change Pi[j] we then go and fix up the entries that refer to Pi[j] and then fix the entries that refer to those entries, and so forth. This fixing up can be done by a log-depth pointer jumping as well (see the next section on pointer jumping) with the advantage that only those entries that are changed need to be examined. Perform the pointer-jumping algorithm described below.

Pointer Jumping for Log-Time Conditional Branches

This subsection explains how to handle correctly predicted taken and untaken branches in parallel. Here is a scheme to use pointer jumping to compute the addresses fast. We explain with an example. Suppose we have the following code 0: a=1000
1: c=d+e 2: e=a+1 3: f=a+2 4: a=a-1 5: c=c+f 6: e=e+1 7: if (a>0) goto 3;

We would like to be able to compute the branch prediction very fast for large issue width (let us use I=64 for example.) (One can make this work whether I is a power of two, or not. We chose a power of two for ease of explanation.) To do this we use a technique known as pointer jumping, which has been used in parallel computing [22]. We fill in a table, with one row for each instruction address and a number of columns equal to lg I. (Think of this table as being built into the instruction cache of the processor.) Let us refer to the jth column of the ith instruction entry as T.sub.ij. We fill in one column at a time, from left to right. We initially fill in column 0 with the next instruction prediction using a standard branch-prediction strategy. (In our example, we will predict that every instruction i (i.noteq.7) is followed by instruction i+1. For Instruction 7 let us assume that the branch prediction hardware predicts that the branch is taken to Instruction 3. Our assumption is reasonable since a is initially very large: Most state-of-the art compilers and profilers and branch predictors would be able to tell us that this particular branch is usually taken.) Thus on Step 0 we have ##EQU19##

Now we iterate for lg I steps. On step k we fill in column k by letting T.sub.i,k =T.sub.(Ti,k-1),k-1.

Thus on Step 1 we have ##EQU20##

and on Step 2 we have ##EQU21##

and on Step 3 we have ##EQU22##

and on Step 4 we have ##EQU23##

and after two more steps we have ##EQU24##

It takes lg I steps to compute this table, since there are lg I columns in the table.

Given Table T, we can, in lg I steps, compute the kth successor to any instruction. Suppose we want to know the instruction that is predicted to run k=37 instructions after the first instruction. We first write k down as a sum of powers of two (i.e., write k in binary): 37=2.sup.5 +2.sup.2 +2.sup.0.

Now we take the exponents. 5,2,0 and use them as follows: We find ##EQU25##

That is we first look up Instruction 0's column 5. We find T.sub.0.5 =7. Then we look up Instruction 7's column 2. (7 comes from T.sub.0.5 and 2 is one of the exponents.) We find T.sub.7.2 =6. Next we look up Instructions 6's column 0. We find T.sub.6.0 =7. Now we know that the 37th instruction after Instruction 0 will be Instruction 7 (if the branch prediction is correct.) We could have looked up the exponents in any order. For example, T.sub.0.2 =4, T.sub.4.0 =5, T.sub.5.5 =7, which gives us 7 again.

Computing the Branch Prediction

The above assumes that the processor has a branch prediction for each relevant situation. A typical branch predictor keeps some state bits, and appends them to the current program counter to look up, in a table, whether the instruction is a taken branch. (That is, for each instruction and for each combination of some state bits, there is a prediction for the next instruction.) Most of today's branch-prediction mechanisms run in time linear in the number of branches. This subsection shows how to compute the branch prediction values faster.

Another strategy is to propagate the table from one fetch destination to another through a parallel prefix tree. In the next section, we will show how to propagate such a table when used for memory renaming.

Another way to do this is shown by this example: In one branch predictor used in current processors, there are four bits, which indicate how each of the most recent four branches were resolved in the dynamic instruction sequence. Those four bits are used to index into a table to get the prediction for the next branch. The leftmost bit is shifted out, and the new branch result is shifted in on the right. A pointer-jumping calculation can be used in this case to compute what the branch prediction bits should look like after four branches are correctly predicted. Suppose we have the following 3-bit branch prediction table

index 0 1 2 3 4 5 6 7 prediction 0 0 1 1 1 1 0 0

If we started with branch history 1=001.sub.2, then we predict that the next branch is not taken, so we then get that the next branch history will be 2=010.sub.2 (shifting the zero to the right and throwing away the leftmost bit.) The next prediction is taken, so we get 5=101.sub.2. The next one is taken, so we get 3=011.sub.2. Thus after four instructions a branch history of 001.sub.2 becomes 011.sub.2. To compute this in log-time we first transform the table to a table with a row indicating the next state.

index 0 = 000.sub.2 1 = 001.sub.2 2 = 010.sub.2 3 = 011.sub.2 4 = 100.sub.2 5 = 101.sub.2 6 = 110.sub.2 6 = 110.sub.2 prediction 0 0 1 1 1 1 0 0 next state 0 = 000.sub.2 2 = 010.sub.2 5 = 101.sub.2 7 = 111.sub.2 1 = 001.sub.2 3 = 011.sub.2
4 = 100.sub.2 6 = 110.sub.2

Now we apply the pointer jumping algorithm to compute what will be the 2nd state after each initial state, and then the 4th state after each initial state.

Index 0 = 000.sub.2 1 = 001.sub.2 2 = 010.sub.2 3 = 011.sub.2 4 = 100.sub.2 5 = 101.sub.2 6 = 110.sub.2 6 = 110.sub.2 prediction 0 0 1 1 1 1 0 0 next state 0 = 000.sub.2 2 = 010.sub.2 5 = 101.sub.2 7 = 111.sub.2 1 = 001.sub.2 3 = 011.sub.2
4 = 100.sub.2 6 = 110.sub.2 2.sup.nd next 0 = 000.sub.2 5 = 101.sub.2 3 = 011.sub.2 6 = 110.sub.2 2 = 010.sub.2 7 = 111.sub.2 1 = 001.sub.2 4 = 100.sub.2 4.sup.th next 0 = 000.sub.2 7 = 111.sub.2 6 = 110.sub.2 1 = 001.sub.2 3 = 011.sub.2 4 =
100.sub.2 5 = 101.sub.2 2 = 010.sub.2

And we can read the state off in logarithmic time.

One of the issues for implementing this pointer-jumping algorithm is that sometimes several fetches are made from the same memory location. There are several approaches to implement those fetches efficiently. Build a multiported memory system. This has the disadvantage of not being scalable, but it can work well for small numbers of concurrent reads. Serialize the fetches, effectively ignoring the problem. It may be the case that the congestion happens infrequently enough that it is not a serious performance problem. Use a combining network, such as the one described for the Ultracomputer [11] or Ranade's algorithm [37]. Use a sorting network, followed by a prefix network, followed by a conflict-free routing network as described above.

4.2.2 Decode Logic

This section describes a novel circuit that can compute the next n states of a finite state machine with log n gate delay. The circuit is a prefix circuit and can be used, among other things, to decode instructions. This is useful for architectures where the instruction length is not known in advance. If you have an array of bytes that encodes a variable-length instruction stream, the problem is to identify those bytes that are the first byte of an instruction. (An example of such an instruction-set would be the Intel IA-32"x86" instruction set.)

If we want to decode in linear time, we can build a finite-state machine (FSM) that can determine the first byte of each instruction by considering the bytes one at a time. Microprocessors today use such a finite state machine. The finite state machine can find the beginnings of the instructions in time that is linear in the number of bytes being analyzed.

For example, suppose that the machine language for the processor looks like this: An instruction can be a basic instruction, which is two bytes long: The first byte of a basic instruction can be any number from 2 to 255. The second byte of a basic instruction can be any number. Or an instruction can have a prefix which is the number 1 followed by a basic instruction. Or an instruction can have a prefix which is the number 0 followed by an optional 1, followed by a basic instruction.

FIG. 25 shows a finite state machine for recognizing the beginnings of these instructions. Part (a) shows the state transition diagram, and Part (b) represents the same information as a state transition table. State zero (notated with a double circle) is the inital state, and indicates the beginning of an instruction.

Suppose we are given the following byte sequence to be interpreted as instructions: 0 1 3 0 1 3 0 1.

The beginnings of the instructions will turn out to be Byte numbers 0 (the first byte), 4, and 7.

The finite state machine would determine this in linear time by stepping through the bytes one at a time:

State before seeing byte Input Byte State after seeing byte 0 0 1 1 1 2 2 3 3 3 0 0 0 1 2 2 3 3 3 0 0 0 1 2

All the lines with 0 in the leftmost column are the beginnings of instructions. We can read, for example, that the next byte will not be the first byte of an instruction, because we will be in State 2.

The rest of this section describes how we can compute the leftmost column, and so determine the beginnings of all instructions, in logarithmic time using a parallel prefix circuit. We start by outlining how to compute the nth entry in the leftmost column. FIG. 26 illustrates. It shows how to compute the 9th entry and so determine whether the 9th byte will be a start byte. The top line of the figure is the input stream. Below each input byte is a Level 0 state vector. Below each pair of Level 0 state vectors is a Level 1 state vector. Below each pair of Level 1 state vectors is a Level 2 state vector. Below each pair of Level 2 state vectors is a Level 3 state vector. The Level 0 state vectors show the final state of the finite state-machine (FSM) for each possible initial state, given the particular input. The Level 1 state vectors show the final state of the FSM for each possible initial state given the pair of input bytes above it. (For example, the entry with a ##EQU26##

around it shows the final state of the FSM if it started in State 3, and saw the bytes 0 and then 1. It would end up in State 2.) The Level 2 state vectors show the final state for each initial state, given the four bytes above. The Level 3
vector shows the final state of the machine, given any initial state, and the 8 bytes of input. We can read that if the machine started in State 0, it ends in State 2 (since 2 is the value of the first line in the Level 3 vector (which corresponds to an initial state of 0).) Thus, the next byte, the 9th byte, will not be a starting byte.

We can implement the algorithm of FIG. 26 using a simple (one-directional) parallel-prefix circuit. We number the states in the finite-state machine. The data being passed up the tree is a vector of state numbers. If there are n states in the finite-state machine, we use a vector of n integers (each integer is [lg n ] bits.) Given such a vector x, the ith element is written x.sub.i. The idea is that vector x will tell us for each state where we will end up after parsing a particular sequence of bytes. A machine starting in State i will end up in State x.sub.i. The identity vector is I.sub.i =i. The identity vector is the vector for parsing no bytes. At the i-th leaf of the tree, we input the vector that tells us for each state where we will end up after parsing the i-th input. At each node of the tree, we compose the two incoming state vectors with this associative composition operator: (xxy).sub.i =Y.sub.x.sub..sub.i . That is, the ith entry in the composition is found by taking the ith entry of x and using that as an index into y. FIG. 27 gives an example of the inputs and the output to a node. FIG. 28 shows one possible implementation of the composition operator. At the root of the tree, an additional multiplexer computes z.sub.s where z is the output vector of the root node and s is the initial state of the FSM before the current sequence of instruction bytes. The resulting output, z.sub.s, indicates whether the 9th byte is not or is a starting byte.

To design a circuit that computes the next k states of a finite state machine, given the next k instruction bytes, we simply turn the above one-directional tree into a two-directional parrallel-prefix tree. The inputs to the tree and the associative operator are the same as above. The prefix tree's i-th output vector, at the i-th leaf node, specifies for each initial state where we will end up after parsing the preceding (i-1) bytes.

For the problem of instruction decoding, we can design our decoder so that the initial state at the oldest byte is always the start state. So to determine whether the i-th byte is a start byte, one simply looks at the 0-th element of the prefix tree's i-th output vector.

In the more general case, we may wish to run a finite-state machine continuously, without resetting it to its starting state at the oldest unparsed byte. This can be useful for the general problem of finite-state interpretation at high-speed, and has many applications, including network protocols, parsing regular expressions, and program compilation. For this general problem, we want to compute for each byte what state the machine is in. To do this, we build a parallel prefix circuit. The operator is the composition operator shown in FIGS. 27 and 28. At each leaf of the parallel prefix tree, an additional multiplexer computes z.sub.s where z is the output vector of the parallel prefix circuit at that node and s is the initial state of the FSM at the oldest byte (i.e., before the current sequence of instruction bytes). The resulting output, z.sub.s, at each leaf indicates the byte's state. Thus, we can keep on continually decoding a stream of instruction bytes (or any other FSM input) written into a wrap-around queue by using a cyclic, segmented parallel prefix circuit. The segment bit identifies the oldest byte in the sequence. The initial state of the oldest byte in the sequence, s, xxy=x can be broadcast to all the leaves using an additional CSPP circuit with the operator. The oldest instruction supplies its initial state as data and sets its segment bit high.

4.2.3 Rename Logic

This section describes a novel circuit that merges a wrap-around sequence of partial tables, tables that contain updates to a subset of the table's entries. The circuit can produce all intermediate tables that reflect earlier updates applied in serial order with gate delay logarithmic in the length of the sequence. The circuit can be used as a replacement for the renaming hardware in today's logic. The renaming hardware's job is take logical register names and change them to physical register names, or tags. The renaming hardware must support the following operations: (1) Find the current tag for a logical register, (2) change the current tag for a logical register, (3) checkpoint the current mapping from logical registers to tags, and (4) restore a mapping from a checkpoint. (The checkpoint and restore are only required for speculative execution, but speculative execution is generally viewed as being indispensable in a modern processor.) Furthermore, it must be possible to perform several of these operations in parallel. The results should be the same as if the operations were performed one after another, serially. In the prior art there are two ways to implement the renaming hardware: direct-addressing, and content-addressing.

Direct Addressing Approach

In direct addressing, there is a RAM table, called T, indexed by logical register number. For a typical RISC processor, the table has 32 entries, one for each register in the instruction-set architecture. Each entry contains a tag. To find the tag currently associated with logical register R.sub.i, one does a fetch from the RAM table, looking up j:=T [i] to get the tag. To change the tag for a logical register, one writes the tag to T[i]:=j. To checkpoint, one copies the entire table T to a "shadow table". To restore, one copies the shadow table back to T.

Note that the number of shadow tables, if small, may limit the amount of speculative execution that can be performed. If the number of shadow tables is large, on the other hand, then it will require much VLSI area to implement. In the prior art, to handle multiple find and change operations at the same time requires a memory with multiple ports. The memory of the prior art is slow if it has too many ports [32].

Content Addressing Approach

In content addressing, there is a table U with one entry for each tag. The jth entry contains the name of the logical register that uses that tag. The find problem for logical register i is to find the youngest entry that names i. The entries may be sorted so that the oldest entries are first (e.g., [36]), in which case the last row that names i is the correct row. Or the entries may be kept unsorted, with an additional "age" associated with each row (e.g., [27].) To find the youngest entry that names the logical register i, the logical register number, i, is broadcast to all the table entries. Each table entry has an associated comparator that compares its entry to i. For the entries that match i, then a reduction operation is performed to determine which is the oldest entry. For either approach [36, 27], no circuit is shown for actually performing the reduction, and no suggestion is made that it could be done in faster than linear time. One reference [27] does show a circuit for finding the oldest row that satisfies a certain condition. To change the tag for a logical register, one writes a new entry into the table. To checkpoint a tag requires noting which entries are currently valid, and which are not. For the sorted system, one just remembers where the boundary is between the unwritten entries and the written entries. For the ageing system, one just remembers the current time. To restore, one deletes the entries that were invalid at the time of the checkpoint. For the sorted system, one simply moves the boundary. For the ageing system, one marks as deleted any entry which has an age indicating that it was created after the time of the checkpoint.

The reuse of entries within the U table introduces additional complexity. For the sorted-by-age systems in the prior art, the system must eventually "retire" renaming entries, writing their register values to a canonical register file. Also, the system must stop processing instructions when it reaches the bottom of the renaming table, and then after retiring all instructions, it can start reusing the top again. Another approach is followed by [8], which keeps the table rows sorted (oldest at the top), but compacts the table entries upward when old entries are no longer needed. Many of the circuits described in [8] are slow as the issue width increases. For example, the compacting circuit must be capable of compacting up to n holes per clock cycle. The compacting circuit's delay is at least O(n).

In addition to those problems, the prior art also has the problem that a special circuit is required to perform dependency checks and renaming between the instructions that are being renamed in the same clock cycle. If two instructions are being renamed in the same clock cycle, then the second one will not see the the first one's entry in the renaming table. Special dependency checking hardware is used to resolve this situation. Such dependency checking hardware is quadratic-time in the prior art [32].

Our Approach

We show a different implementation of renaming logic. Our implementation uses "partial renaming maps". A partial renaming map renames only some registers, not all of them. First we explain the concept using a mathematical notation, then we show circuitry to implement partial renaming maps. Here is the notation: Suppose that we have a sequence of instructions as follows:

Instruction Renaming R.sub.1 = R.sub.2 /R.sub.3 {1.fwdarw.42} R.sub.3 = R.sub.4 + R.sub.1 {3.fwdarw.43} R.sub.1 = R.sub.0 + R.sub.2 {1.fwdarw.44} R.sub.4 = R.sub.1 + R.sub.2 {4.fwdarw.45}

Here we have shown the tag that each instruction's destination register is renamed to. (For example, Register 1 is renamed to Tag 42.) The effect of the entire sequence can be written as {1.fwdarw.44,3.fwdarw.43,4.fwdarw.45}.

Note that we did not include the effect of the first renaming because after renaming the sequence it is no longer important. We can compose two renamings as follows: Suppose we have the renaming from the first pair as {1.fwdarw.42,3.fwdarw.43}, and the renaming of the second pair as {1.fwdarw.44,4.fwdarw.45}.

The composed renaming is {1.fwdarw.42,3.fwdarw.43}.circle-w/ dot.{1.fwdarw.44,4.fwdarw.45}={1.fwdarw.44,3.fwdarw.43,4.fwdarw.45}.

The rule for composing two renamings A.circle-w/dot.B=C is that renaming x.fwdarw.y is in C if x.fwdarw.y is in B or (x.fwdarw.y is in A and there is no renaming of x in B.)

FIG. 29 shows how to implement a partial renaming in hardware. For each partial renaming, one keeps a table T of renamings, with an additional bit vector V indicating which of the table entries are valid. The top of the figure shows the hardware representation of three renamings. Renaming A (left) is combined with Renaming B (center) to get Renaming C (right). Each renaming consists of a bit vector V indicating which entries are valid, and a table T of entries. We show only the valid entries in this example. At the bottom is shown the circuit which computes a row of C given a corresponding row of A and B. Note that the rows are independent. The circuit comprises on OR gate 550 and a multiplexer 555. The inputs to the OR gate are the values of V in a given row in A and B. The inputs to the multiplexer are the values of T in the same row in A and B. The value of V from Renaming B is used to select the output of the multiplexer.

Suppose we wish to rename a large number of instructions. We start with a complete renaming, A, that corresponds to the renaming state before renaming any instruction. (A complete renaming is a partial renaming in which all the V bits are 1.) We then create a partial renaming for each of the instructions (note that each instruction typically renames only one register.) Call those partial renamings B.sub.0, . . . B.sub.n-1. We then would like to compute the effect, for each instruction, of all previous renamings. That is, for instruction i we would like to compute C.sub.i =A.circle-w/dot.B.sub.0.circle-w/dot.B.sub.1.circle-w/dot.. . . .circle-w/dot.B.sub.i-1.

We can do this in linear time simply by attaching N of the partial renaming circuits of FIG. 29.

FIG. 30 and FIG. 31 describe circuitry that implements our renaming approach in logarithmic gate delay as opposed to linear gate delay. Specifically, the figures show renaming logic that renames a sequence of four instructions and uses two logical registers. The sequence of instructions wraps around, with the oldest instruction identified by its signal Oldest being high. The circuitry comprises two components. FIG. 30 shows the circuitry that computes the tags for all four instructions' partial renaming tables. FIG. 31 shows the circuitry that computes the valid bits for all four instructions' partial renaming tables. Each figure shows two CSPP circuits 570 and 580, one for each of the two logical registers. For a machine with 32
registers, 32 CSPP circuits would be needed.

In addition to its Oldest bit, each instruction sends into the circuitry a tag, TagIn, and an array of valid bits, ValidIn.sub.i, one for each logical register. If ValidIn.sub.j is high, then the instruction has renamed register R.sub.j to TagIn. The circuitry returns to each instruction an array of tags, TagOut.sub.i, and an array of valid bits, ValidOut.sub.i, one for each logical register. If ValidOut.sub.j is high, then TagOut.sub.i contains the tag to which logical register R.sub.j is currently renamed. If ValidOut.sub.j is low, then the preceding instructions in the sequence have not renamed register R.sub.j.

The circuitry in both FIG. 30 and FIG. 31 include CSPP circuits. FIG. 30 uses one "oldest" CSPP circuit 570 for each logical register. The "oldest" CSPP circuit is described in FIG. 15 and FIG. 13. FIG. 31 uses one disjunction CSPP circuit 580
for each logical register. The disjunction CSPP circuit is described in FIG. 18 and FIG. 13. The circuitry in FIG. 30 and FIG. 31 can be easily extended to rename more instructions or to use more logical registers. For each new logical register, we add one more CSPP circuit to each figure. For each additional instruction, we add one more leaf node to every CSPP circuit in the two figures. The oldest instruction provides a complete renaming table, providing all ValidIns to be true and all TagIns to be the respective tags from the previous instruction execution.

The advantages of this circuitry are that it runs in logarithmic time, it requires no compacting, and it keeps instructions sorted in chronological order (which can help with certain other aspects of the circuit design.) Checkpointing and restoring are straightforward, and require no additional shadow storage. To restore renamings on a misprediction, we simply label the mispredicted branch instruction to be the newest, effectively moving the pointer to the newest instruction. Furthermore, because we designed this circuit as a cyclic circuit, when we get to the bottom of the renaming table, we can simply start using entries from the top of the table without interruption. Also no special additional dependency check needs to be done, since the correct renaming happens in a single clock cycle. Also, when an instruction commits, its result tag does not need to be written to a canonical renaming table, which saves additional complexity.

Reusing Tags

Here we describe several novel circuits that, given an array of bits, compute the binary (or unary) indexes of the first N entries in the array that are high. One of the applications of these circuits can be the reuse of tags in a superscalar processor.

On every clock cycle, a superscalar processor allocates a tag to the result register of every instruction entering the rename stage. Different superscalar implementations allocate tags differently. For example, some processors assign tags in a forward or wrap-around sequence. That is, the first instruction uses Tag 0, the next uses Tag 1, until all the tags are used up, by which time some of the early tags have retired, and we go back and start using Tags 0 and 1 again. Often, the tag number is the same as the number of the reservation station. The allocation of tags in such a system is the same as the allocation of reservation stations in the reordering buffer. The disadvantage of such a scheme is that the tags and the buffer entries are allocated together, and the tags cannot be reused without clearing out the buffer entries (e.g., by writing them to a canonical register file.) Such a write has typically been performed by a multiported register file which has poor scaling properties.

In other processors, tags are allocated out of a pool of tags. One advantage of this approach is that a separate logical register file may not be necessary. The problem for this system is to allocate tags quickly. On every clock cycle, the processor must identify I tags to reuse, where I is the number of instructions to be renamed per clock cycle. In this problem, there is a collection of tags that are free, and we need to find the names of several of those. Our input is a bit vector x indicating which tags are free. x.sub.i is one if tag i is free, and zero if tag i is not free. The problem then is to determine the indexes of the first N bits that are one in a vector of bits. Finding the indices of the first N true bits has other applications, such as ALU scheduling, as we shall see below.

To find the first bit that is one, we ask "what is the maximum i such that x.sub.i =1?" FIG. 32 shows how to compute that maximum i. The circuit is shown for a bit vector of length 16. The bit vector comes from the top, and out of the bottom comes a signal indicating that some bit is 1 (the "free exists?" signal) along with 4 bits indicating the number of the highest-numbered one. The 16 bits of the input bit vector are applied in pairs to eight two-input OR gates 600. The outputs of the eight OR gates 600 are applied in pairs to four two-input OR gates 605. In turn, the outputs of the OR gates 605 are applied in pairs to two two-input OR gates 610 and the outputs of OR gates 610 are applied to two-input OR gate 615. Every other bit of the input bit vector is also applied to a two-input multiplexer (MUX) 620. The outputs of MUXes 620 are applied in pairs to two input MUXes 625 and the output of MUXes 625 are applied to MUX 630. The output of every other OR gate 600 is also applied in pairs to two-input MUXes 635 and the output of MUXes 635 is applied to MUX 640. The outputs of every other OR gate 605 are also applied to two-input MUX 645. The MUXes are controlled by the output of various OR gates as shown in FIG. 37. In particular, the four MUXes 620 are controlled by the output of every other one of the eight OR gates 600. MUXes 625 and 635 are controlled by the output of every other one of the four OR gates 605. And MUXes 630, 640 and 645 are controlled by one of the two OR gates 610.

The underlying structure of this circuit will be apparent. At a first stage, each four bits of the input bit vector are processed by two OR gates 600, one OR gate 605 and a MUX 620 to produce a set of three outputs. In the next stage, each set of three outputs is processed by OR gate 610 and MUXes 625 and 635 with another set of three outputs to produce a set of four outputs. And in the final state, the two sets of four outputs are processed by OR gate 615 and MUXes 630, 640, 645 to produce the five outputs of the circuit.

To find the one bit of the 8-bit system requires building two circuits to find the bitnum of 4 bits. The two subcircuits are combined with some MUXes and an OR gate. If there is a bit in the left batch of 4, the MUXes select the low order bits of the bit number from the left side, otherwise the low order bits come from the right side. It will be clear to the reader how to extend this system to handle any number N of tags with gate-delays that are only O(log N).

Prior art for this problem includes [9] and [12]. Both perform an operation similar to the circuit of FIG. 32. In both of those circuits the output is encoded in unary, where in our circuit the result is output in binary. The circuits shown here are also faster and smaller.

Another variation works as follows: We use a parallel-prefix circuit with summation (e.g., see FIG. 2). The inputs provided are the free bit vector x indicating which tags are free (encoded as a one) and which are busy (encoded as a zero.) The outputs indicate for each bit, how many bits to the left of that bit are true. This number thus provides each bit with a unique nonnegative number in a contiguous range starting at zero. Let us call the output the enumeration index. Now each bit has an address, which is its index in the set of all bits, and an enumeration index which is the index in the set of all one bits. Each one-bit then sends its own address through a butterfly switch to the destination specified by the enumeration index. It is known that every such routing is conflict free in a butterfly switch [17]. Now destination zero has received the address of the first one bit, and destination one has received the address of the second one bit and so forth. An added innovation is to use a cyclic, segmented parallel-prefix circuit, with the segment bit indicating which bit is the oldest in a circular list. The result is that destination zero gets the address of the first free bit, in the cyclic ordering starting from the oldest, and destination one gets the next free bit, and so on.

Related prior art for the above circuit includes an enumeration algorithm that was described in a software context by Christman [4]. Our system is different from [4] in the following ways: (a) Our circuits are used in hardware in a microprocessor, whereas Christman's was used in software on a parallel computer. (b) We also have a cyclic version of the rendezvous.

Our enumeration algorithm is also superior to the ones of [12, 9] in that we can find an arbitrary number of first bits. [12] can only find the first 4 of 24 stations. [9] can only find the first one or two values. Also our system provides the results in binary or unary form (the address can be sent through the switch using binary or unary formatting), instead of only unary form. Also our system provides a cyclic allocator, whereas the systems in the prior art can only allocate from "left-to-right." The prior art machines have either (a) compacted the instructions in the window (which can be expensive in VLSI area and critical-path delay) [8], or (b) effectively drained the instruction window when it fills up (which hurts performance), or (c) allocated the leftmost bit even when the oldest instruction is in the middle of the window (which, e.g. when used in a scheduler, can cause newer instructions to run before older ones, which delays retiring instructions, which can hurt performance.)

Instead of using a butterfly switch to route the free indices to their respective destinations, we can take advantage of a unary encoding. In this case, after computing the enumeration, each one bit performs a write of a one into a two dimensional array of bits. The row number is the address of the one-bit, the column number is the enumeration index of the bit. Note that no two one bits could try to write into the same row, so no routing circuit is needed. All that is required is an address decoder, one per row, to write a one to the appropriate bit in the row according to the enumeration index. At this point destination j reads column j and retrieves a unary representation of the name of the newly allocated tag. Sometimes this circuit is more efficient than the butterfly circuit mentioned above. For example, if destination j needs to control a MUX to read tag i, having i in unary form can be faster because the MUX selectors can be driven directly instead of having to decode i first.

4.2.4 Register Dependence Analysis

This section describes a novel circuit that passes information about the validity of table entries through a wrap-around sequence of operations, where each operation can invalidate, validate, or leave unchanged each table entry. The circuit can be used to implement register dependence analysis, also known as wake-up logic.

Recall that in the dependence analysis stage, instructions reside in the reordering buffer. The role of the dependence analysis stage is to identify those instructions that are ready to run. An instruction is ready to run when all of its necessary register, memory, and control dependencies have resolved. (Not all instructions must wait for memory and control dependencies to resolve.) In this section, we describe logarithmic, delay circuits that identify instructions whose register dependencies have been met. In the following two sections, we describe circuits that identify instructions whose memory and control dependencies have been met.

The circuit that identifies when register dependencies have been met is the wake-up logic. The wake-up logic informs an instruction whenever one of its argument registers is computed. Typically the wake-up logic is implemented using associative memory. For every instruction in the reordering buffer, the wake-up logic compares the tags of its arguments against the tag of every broadcasted result. An instruction is awakened once all of its arguments have been broadcasted.

Our approach to implementing wake-up logic is based on prefix circuits and eliminates associative memory. The implementation is analogous to our implementation of renaming logic in Section 4.2.3. We use prefix circuits to compute information about each instruction's logical register file. Using prefix circuits, each instruction informs its successors which register it modifies and whether it has already computed the register's value. The prefix circuits in turn present each instruction with a bit vector indicating which of its incoming register values have already been computed. If the registers that an instruction uses (its "argument registers") have been computed already, the instruction is awakened.

FIG. 33 describes the circuitry that implements our wake-up logic in logarithmic gate delay using parallel prefix circuits 660, 665. Specifically, the figure shows wake-up logic that operates on a four-instruction reordering buffer and uses a two-register logical register file (L=2). (We use a small register file for ease of explanation. It is clear how to make the register file larger.) The sequence of instructions in the reordering buffer wraps around, with the oldest instruction identified by its signal Oldest being high. In addition to its Oldest bit, each instruction sends into the circuitry an L-wide array of modified bits, Modified.sub.i, and, in general, an L-wide array of valid bits, ValidIn.sub.i. For each instruction, ##EQU27##

Note that in FIG. 33 we assume that each instruction modifies at most one register R.sub.j and so all the ValidIn signals for a given instruction, ValidIn.sub.i through ValidIn.sub.(L-1), can be set to ValidIn.sub.j. The circuitry returns to each instruction an L-wide array of valid bits, ValidOut.sub.i. If an instruction's ValidOut.sub.i is high, then the instruction's register R.sub.j has been computed.

FIG. 33 uses one "oldest" CSPP circuit for each logical register. The "oldest" CSPP circuit is described in FIG. 15 and FIG. 13. The circuitry in FIG. 33 extends easily to wake up more instructions or to use more logical registers. For each new logical register, we add one more CSPP circuit. For each additional instruction, we add one more leaf node to every CSPP circuit. In mathematical notation, for each logical register R.sub.j, we have a CSPP with

where i is the instruction number.

The circuitry we have described so far tells instructions when their arguments become available, but it does not tell them the argument values. There are a number of ways to pass argument values, once an instruction knows when its arguments become available. For instance, we can completely eliminate broadcasting by storing each result directly, according to its tag. In this implementation, computed results are stored directly. Consumer instructions are notified that the result has been computed via prefix circuits in FIG. 33. Once notified, consumer instructions read the result value directly, according to its tag. An alternate approach to passing argument values retains broadcasting. When an instruction is informed by a prefix circuit that its argument has just become available, it latches the broadcasted value. It there are multiple broadcast buses, the prefix circuits can compute which bus to latch from by passing bus identifiers in addition to valid bits along the prefix circuits in FIG. 33. Finally, a third approach passes argument values along the prefix circuits together with the valid bits. This is the approach taken in our Ultrascalar processor, described in Section 4.3.

4.2.5 Memory Dependence Analysis

In order to exploit instruction level parallelism (ILP), the memory bandwidth must also scale. In particular, the processor must be able to issue more loads and stores per clock cycle (i.e. sustain a higher data bandwidth) and the fetch unit must supply more instructions along a correctly predicted program path per clock cycle (i.e. sustain a higher effective instruction bandwidth.)

To accommodate more loads and stores per clock cycle without scaling up the number of data cache read and write ports, we can resort to the well known mechanism of interleaving. FIG. 34 illustrates an interleaved data memory subsystem. The memory subsystem comprises an on-chip level-one cache 670 and an on-chip butterfly network 675 connecting the cache to the execution stations ES. Much like the main memory in traditional supercomputers [39], the cache is interleaved among a number of banks CB. In this example, the number of cache banks CB is the same as the number of execution stations ES. In particular, the system shown has 16 execution stations ES, 16 cache banks CB and a butterfly network of four tiers of eight two-input, two-output switches 677 connected as shown. The nth cache bank holds every nth cache line. (To illustrate, let us assume that the number of banks and execution stations is sixteen, that the cache is direct mapped with block size of one word and total size of 1 MB, and that the instruction set architecture uses 32-bit byte addresses. A memory access to address A will then be routed to bank A[5-2]. Bank A[5-2] will find entry A[19-6] and compare its tag against address bits A[31-20]. Should the comparison fail, the level-one cache will access off-chip memory.) The cache banks are connected to the execution stations via a butterfly network and to off-chip memory (not shown) directly. The butterfly network allows n load and store requests to proceed in parallel if they form a rotation or other conflict-free routing. For example, a vector fetch loop will run without any conflicts if the address increment equals the cache block size. But in general, multiple memory accesses may compete for the same bank, or they may suffer from contention in the butterfly, thus lowering the memory bandwidth. The same programming, compiler, and hardware techniques used to alleviate bank conflicts in an interleaved main memory will apply to an interleaved data cache. (For example, making array lengths be relatively prime to each other can reduce the conflicts to a banked memory system.)

Note that the circuitry of a scalable memory system can be used to fetch both instructions and data from memory to the processor.

Another promising approach to increasing data memory bandwidth is to duplicate the data cache. We could allocate a copy of the cache to different subtrees of the Ultrascalar datapath or, in the extreme, to every execution station. Duplicate caches must be kept coherent using, for example, a scalable SMP cache-coherence protocol.

Given that there are several ways to provide memory bandwidth, there is still the problem of coping with memory dependencies.

This section describes several novel circuits that enforce or eliminate memory dependencies in a wrap-around sequence of instructions. The simplest circuit detects when all preceding instructions in a wrap-around sequence of instructions have finished writing memory. An instruction in the sequence is free to read memory once all preceding instructions have finished writing memory. The more complex circuits pass memory addresses and values from producer to consumer instructions. They do so by propagating information about all preceding memory writes to every instruction. The circuits vary in complexity based on whether they append information about subsequent writes or sort the information by address, whether they compact the information, and whether they eliminate older writes to the same address from the information.

In addition to resolving register dependencies, memory instructions must also resolve their memory address dependencies before they access memory. For example, if there are several load instructions between two store instructions in the reordering buffer, then the loads can run in parallel. But the loads must often wait for at least some previous store operations to complete to get the right data. (Note that if we speculate on the data then the loads need not necessarily wait for the previous stores. Below, we discuss branch speculation, which has similar issues to data speculation.) We first present the design for a conservative strategy, which is to wait for all previous store instructions to complete, then we show how to relax this constraint to exploit more parallelism.

Conservative Approach

Our conservative approach requires only one cyclic, segmented conjunction prefix. FIG. 16 and FIG. 13 described a conjunction CSPP. The number of leaf nodes in the CSPP tree, n, is equal to the size of the reordering buffer. The CSPP's inputs are: ##EQU28##

The prefix circuit computes for each instruction whether all previous writes have completed.

Memory Renaming

Some of today's processors perform renaming on memory locations. This allows loads and stores to execute out-of-order and in parallel in the case where the memory addresses are known. In this situation, a read must wait until the memory addresses of all the previous writes are known and all writes to the same address have completed. (We have heard that the Pentium Pro memory subsystem uses memory renaming to allow out-of-order memory operations. In the Pentium Pro scheme, some small number of memory operations are queued on-processor. The off-chip memory operations are performed serially. The on-chip queue will allow reads to complete out-of-order, even if unrelated prior writes have not been committed to memory.)

Memory-renaming has three sources of delays: (1) The addresses of loads and stores are generally not known right away. (2) The data is often not ready. (3) The circuitry for moving data to a load from the correct previous store may introduce delays. This section shows how to keep the circuit delay down to logarithmic cost, and to minimize the impact of the other delays. The trick is to determine when there are dependencies between instructions as soon as possible after an address becomes known, and to pass the data between the instructions as soon as the data becomes available.

Our novel implementation of memory renaming passes a partial memory-renaming table along a prefix circuit called the memory renaming datapath. The approach is comparable to our partial register-renaming maps. Recall that, to rename registers, we passed information about the current renaming of every logical register throu