Home
Patent Search
IMT Blog
REGISTER
|
SIGN IN
United States Patent
5832297
Ramagopal , ; et al.
November 3, 1998
Title
Superscalar microprocessor load/store unit employing a unified buffer and separate pointers for load and store operations
Abstract
A load/store buffer is provided which allows both load memory operations and store memory operations to be stored within it. Because each storage location may contain either a load or a store memory operation, the number of available storage locations for load memory operations is maximally the number of storage locations in the entire buffer. Similarly, the number of available storage locations for store memory operations is maximally the number of storage locations in the entire buffer. This invention improves use of silicon area for load and store buffers by implementing, in a smaller area, a performance-equivalent alternative to the separate load and store buffer approach previously used in many superscalar microprocessors.
Inventors:
Ramagopal; H. S.
(Austin,
TX
)
, Tran; Thang M.
(Austin,
TX
)
, Pickett; James K.
(Austin,
TX
)
Assignee:
Advanced Micro Devices, Inc.
(Sunnyvale,
CA
)
Appl. No.:
968308
Filed:
November 12, 1997
Current U.S. Class:
710/5
710/56
710/6
712/217
712/23
Current International Class:
G06F 9/38 (20060101)
Field of Search:
395/825,826,876,393,800.23
U.S. Patent Documents
4044338
August 1977
Wolf
4453212
June 1984
Gaither et al.
4722049
January 1988
Lahti
4807115
February 1989
Torng
4858105
August 1989
Kuriyama et al.
5226126
July 1993
McFarland et al.
5226130
July 1993
Favor et al.
5251306
October 1993
Tran
5455924
October 1995
Shenoy et al.
5487156
January 1996
Popescu et al.
5524263
June 1996
Griffth et al.
5621896
April 1997
Burgess et al.
5655098
August 1997
Witt et al.
5694553
December 1997
Abramson et al.
5745729
April 1998
Greenley et al.
Foreign Patent Documents
0 391 517 A2
Oct., 1990
EP
0 436 092 A2
Jul., 1991
EP
0259095
Mar., 1988
EP
0381471
Aug., 1990
EP
0459232
Dec., 1991
EP
2263985
Aug., 1993
GB
2263987
Aug., 1993
GB
2281422
Mar., 1995
GB
Other References
Johnson, M., "Superscalar Microprocessor Design," 1991, Prentice Hall, New Jersey, US, XP002027574, pp. 99-170. .
Yeh, D. Yun et al., "Dynamic Initial Allocation and Local Reallocation Procedures for Multiple Stacks," Communications of the Association for Computing Machinery, vol. 29, No. 2, Feb. 1986, New York, US, XP002027605, pp. 134-141. .
Korsh, J.F. et al, "A Multiple-Stack Manipulation Procedure," Communications of the Association for Computing Machinery, vol. 26, No. 11, Nov. 1983, New York, US, XP002027606, pp. 921-923. .
International Search Report for PCT/US 96/11843 dated Apr. 4, 1997. .
Intel, "Chapter 2: Microprocessor Architecture Overview," pp. 2-1 through 2-4, Sep. 1995. .
Michael Slater, "AMD's K5 Designed to Outrun Pentium," Microprocessor Report, vol. 8, No. 14, Oct. 24, 1994, 7 pages. .
Sebastian Rupley and John Clyman, "P6: The Next Step?," PC Magazine, Sep. 12, 1995, 16 pages. .
Tom R. Halfhill, "AMD K6 Takes On Intel P6," BYTE, Jan. 1996, 4 pages..~
Primary Examiner:
Kim; Kenneth S.
Attorney, Agent or Firm:
Conley, Rose & Tayon, PC Kivlin; B. Noel
Parent Case Text
This application is a continuation of application Ser. No. 08/420,747, filed Apr. 12, 1995 now abandoned.
Claims
What is claimed is:
1. A load/store unit for a superscalar microprocessor comprising:
a buffer including a plurality of storage locations configured to store information regarding pending memory operations wherein said buffer further includes an input port configured to receive said information, and wherein said buffer further includes a data cache port configured to communicate data access commands to a data cache, and wherein said plurality of storage locations are configured such that a load memory operation may be stored in one of said plurality of storage locations in a given clock cycle and a store memory operation may be stored in said one of said plurality of storage location in a different clock cycle;
an input control unit coupled to said buffer, wherein said input control unit is configured to direct the transfer of said information from said input port to a particular storage location within said buffer, wherein said input control unit includes a load pointer configured to selectively direct the storage of load memory operations into various ones of said plurality of storage locations and a store pointer configured to selectively direct the storage of store memory operations into additional ones of said plurality of storage locations, wherein said load pointer and said store pointer are configured to point to different storage locations simultaneously; and
an output control unit coupled to said buffer, wherein said output control unit is configured to select a memory operation stored within said plurality of storage locations within said buffer, and wherein said output control unit is further configured to direct a data access command associated with said operation to said data cache.
2. The load/store unit as recited in claim 1 wherein said buffer is configured as a linear array of storage locations for memory operations.
3. The load/store unit as recited in claim 1 wherein said, wherein said load pointer is changed in response to a storage of a new load memory operation into said buffer, while said store pointer is held constant.
4. The load/store unit as recited in claim 3 wherein said load pointer advances from one end of said buffer and said store pointer advances from the opposite end of said buffer.
5. The load/store unit as recited in claim 1 wherein said output control unit is configured to select said memory operation from said buffer according to a fixed priority scheme.
6. The load/store unit as recited in claim 1, wherein said output control unit is configured to select said memory operation from said buffer according to a scheme wherein:
store memory operations that are not speculative are given a high priority;
memory operations that are not speculative and are known to miss said data cache via previous access to said data cache are given an intermediate priority; and
load memory operations that have not previously accessed said data cache are given a low priority.
7. The load/store unit as recited in claim 6, wherein said output control unit is configured to receive information regarding the speculative state of said memory operation via a reorder buffer pointer provided by a reorder buffer.
8. A method for operating a load/store buffer of a load/store unit including a plurality of storage locations comprising:
maintaining a load pointer value for selectively controlling locations of said buffer to which load memory operations are stored;
maintaining a store pointer value for selectively controlling locations of said buffer to which store memory operations are stored;
storing a store memory operation into one of said plurality of storage locations;
modifying said store pointer in response to storing said store memory operation while holding said load pointer constant;
removing said store memory operation from said one of said plurality of storage locations;
storing a load memory operation into said one of said plurality of storage locations; and
modifying said load pointer in response to storing said load memory operation while holding said store pointer constant.
9. The method as recited in claim 8 wherein said storing a store step is performed, then said removing step is performed, then said storing a load step is performed.
10. The method as recited in claim 8 wherein said storing a load is step is performed, then said removing step is performed, then said storing a store step is performed.
11. The method as recited in claim 8 wherein said removing further comprises removing store memory operations from said buffer according to their speculative state as indicated by a reorder buffer pointer provided by a reorder buffer.
12. The method as recited in claim 8 wherein said removing further comprises removing load memory operations from said buffer based on their speculative state as indicated by a reorder buffer pointer provided by a reorder buffer.
13. The method as recited in claim 8 wherein said removing further comprises removing load memory operations from a buffer if said load memory operations are known to hit said data cache.
14. A load/store unit for a superscalar microprocessor comprising:
a buffer including a plurality of storage locations configured to store information regarding pending memory operations wherein said buffer further includes an input port configured to receive said information, and wherein said buffer further includes a data cache port configured to communicate data access commands to a data cache, and wherein said plurality of storage locations are configured such that a load memory operation may be stored in one of said plurality of storage locations in a given clock cycle and a store memory operation may be stored in said one of said plurality of storage location in a different clock cycle;
an input control unit coupled to said buffer, wherein said input control unit is configured to direct the transfer of said information from said input port to a particular storage location within said buffer, wherein said input control unit further comprises a load pointer and a store pointer, and wherein said load pointer is configured to direct a given load memory operation received in a particular clock cycle to a selected one of said plurality of storage locations, and wherein said store pointer is configured to direct a given store memory operation received in said particular clock cycle to another of said plurality of storage locations, and wherein said load pointer advances from one end of said buffer and said store pointer advances from the opposite end of said buffer; and
an output control unit coupled to said buffer, wherein said output control unit is configured to select a memory operation stored within said plurality of storage locations within said buffer, and wherein said output control unit is further configured to direct a data access command associated with said operation to said data cache.
15. The load/store unit as recited in claim 14 wherein said buffer is configured as a linear array of storage locations for memory operations.
16. The load/store unit as recited in claim 14 wherein said output control unit is configured to select said memory operation from said buffer according to a fixed priority scheme.
17. The load/store unit as recited in claim 14, wherein said output control unit is configured to select said memory operation from said buffer according to a scheme wherein:
store memory operations that are not speculative are given a high priority;
memory operations that are not speculative and are known to miss said data cache via previous access to said data cache are given an intermediate priority; and
load memory operations that have not previously accessed said data cache are given a low priority.
18. The load/store unit as recited in claim 17, wherein said output control unit is configured to receive information regarding the speculative state of said memory operation via a reorder buffer pointer provided by a reorder buffer.
19. A load/store unit for a superscalar microprocessor comprising:
a buffer including a plurality of storage locations configured to store information regarding pending memory operations wherein said buffer further includes an input port configured to receive said information, and wherein said buffer further includes a data cache port configured to communicate data access commands to a data cache, and wherein said plurality of storage locations are configured such that a load memory operation may be stored in one of said plurality of storage locations in a given clock cycle and a store memory operation may be stored in said one of said plurality of storage location in a different clock cycle;
an input control unit coupled to said buffer, wherein said input control unit is configured to direct the transfer of said information from said input port to a particular storage location within said buffer; and
an output control unit coupled to said buffer, wherein said output control unit is configured to select a memory operation stored within said plurality of storage locations within said buffer, and wherein said output control unit is further configured to direct a data access command associated with said operation to said data cache, and wherein said output control unit is configured to select said memory operation from said buffer according to a scheme wherein:
store memory operations that are not speculative are given a high priority;
memory operations that are not speculative and are known to miss said data cache via previous access to said data cache are given an intermediate priority; and
load memory operations that have not previously accessed said data cache are given a low priority.
20. The load/store unit as recited in claim 19 wherein said buffer is configured as a linear array of storage locations for memory operations.
21. The load/store unit as recited in claim 19 wherein said input control unit further comprises a load pointer and a store pointer, and wherein said load pointer is configured to direct said load memory operation received in a given clock cycle to one of said plurality of storage locations, and wherein said store pointer is configured to direct said store memory operation received in a given clock cycle to another of said plurality of storage locations.
22. The load/store unit as recited in claim 21 wherein said load pointer advances from one end of said buffer and said store pointer advances from the opposite end of said buffer.
23. The load/store unit as recited in claim 19, wherein said output control unit is configured to receive information regarding the speculative state of said memory operation via a reorder buffer pointer provided by a reorder buffer.
24. A load/store unit for a microprocessor comprising:
a buffer including a plurality of storage locations configured to store pending memory operations, wherein said buffer further includes an input port configured to receive said operations, wherein said plurality of storage locations are configured such that a pending load operation may be stored in a particular storage locations at a first time and a pending store operation may be stored in said particular storage location at a second time, and wherein said buffer is configured to communicate with a data cache;
an input control unit coupled to said buffer, wherein said input control unit is configured to direct the transfer of said operations from said input port to particular storage locations within said buffer, wherein said input control unit includes a load pointer configured to direct the storage of pending load memory operations into particular storage locations and a store pointer configured to direct the storage of pending store memory operations into particular storage locations, wherein said load pointer and said store pointer are configured to point to different storage locations simultaneously; and
an output control unit coupled to said buffer, wherein said output control unit is configured to select a particular pending memory operation stored within said plurality of storage locations, and wherein said output control unit is further configured to direct a data access command associated with said particular pending memory operation to said data cache.
25. The microprocessor as recited in claim 24, wherein said buffer is configured to receive and store multiple pending memory instructions in a particular clock cycle.
26. The microprocessor as recited in claim 25, wherein said load pointer is adjusted in response to the storage of a new load memory operation into said buffer, wherein said store pointer is unchanged in response to the storage of said new load memory operation.
27. The microprocessor as recited in claim 26, wherein said storage locations within said buffer are configured as a one-dimensional array having a first end and a second end, wherein said input control unit is configured to store pending load instructions in the closest available storage location to said first end, wherein said input control unit is configured to store pending store instructions in the closest available storage location to said second end.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to superscalar microprocessors and, more particularly, to a load/store unit of a superscalar microprocessor.
2. Description of the Relevant Art
Superscalar microprocessors obtain high performance in computer systems by attempting to execute multiple instructions concurrently. One important way in which superscalar microprocessors achieve high performance is through the use of speculative execution of instructions. As used herein, an instruction is speculatively executed if it is executed before the execution is known to be required by the program being executed. An instruction may be executed speculatively, for example, if a branch instruction is ahead of it in program instruction sequence and the processor has not yet calculated which path through the program the branch will select. Many other examples of speculatively executing instructions exist in superscalar microprocessors.
Due to the widespread popularity and acceptance of the x86 architecture, microprocessor designers have made efforts to create superscalar microprocessors that implement this architecture. By supporting this architecture, such designers advantageously maintain backwards compatibility with previous implementations such as the 8086, 80286, 80386, and 80486; and the large amount of software written for these implementations.
Superscalar microprocessors are employed within computer systems. These computer systems typically contain a variety of other devices including fixed disk drives, video display adapters, floppy disk drives, etc. Also needed in computer systems is a relatively large main memory which stores the instructions that the microprocessor will execute and data the microprocessor will manipulate, until such data or instructions are requested by the microprocessor. This memory is typically composed of dynamic random access memory chips, herein referred to as "DRAM". The amount of time necessary from the request of a storage location within the DRAM to the data becoming available at the outputs of the DRAM chips, herein referred to as DRAM access time, has not decreased significantly. Instead, as semiconductor fabrication technology has improved, DRAM manufacturers have chosen to make larger amounts of DRAM memory available on a single monolithic chip. Although a single memory location in a modern DRAM can react much faster than locations in older DRAM, the larger number of locations available loads the outputs of the DRAM, making the DRAM access time substantially the same from generation to generation of DRAM devices. However, superscalar microprocessor designers have used semiconductor manufacturing technology improvements to create microprocessors that run at faster clock rates and that are capable of executing more instructions simultaneously. As used herein "clock cycle" or "clock rate" refers to a unit of time in which a microprocessor performs its various functions, such as instruction execution, memory request, etc. At the end of a clock cycle, the results for that cycle (such as the result an instruction execution produces) are saved so that another part of the microprocessor (i.e. a subsequent pipe stage) will have the results available in the next clock cycle for subsequent manipulation or storage. As a result of the aforementioned speed difference between modern microprocessors and DRAM memory, the memory bandwidth requirements of microprocessors have increased but available memory bandwidth has not increased. In other words, more recent microprocessors are running substantially faster than older microprocessors and are coupled to larger DRAM memories (allowing larger applications and data sets) that are running at a speed similar to previous versions of DRAM memories. A large performance problem can be seen with this configuration, in that the microprocessor in many cases will be waiting for instructions and data to be provided by memory, reducing the computer system's overall performance.
Superscalar microprocessor designers have made efforts to solve the problem of accessing a slow memory. Part of this solution involves including caches into the microprocessor designs. Caches are small, fast memories that are either included on the same monolithic chip with the microprocessor core, or are coupled nearby. Data and instructions that have been used recently by the microprocessor are typically stored in these caches, and are written back to memory after the instructions and data have not been accessed by the microprocessor for some time. The amount of time necessary before instructions and data are vacated from the cache and the particular algorithm used therein varies significantly among microprocessor designs, and are well known. Data and instructions may be stored in a shared cache, variously referred to as a combined cache or a unified cache. Also, data and instructions may be stored in distinctly separated caches, typically referred to as an instruction cache and a data cache.
Caches are typically organized as an array of "lines". The term "line" is used herein to refer to some number of memory locations configured to store contiguous bytes of data or instructions from main memory. When the microprocessor accesses the cache, a portion of the address is used to "index" the cache. Indexing the cache refers to choosing a line or set of lines to access, searching for the contents of the address being requested. If one of the lines so examined contains the data or instructions that reside in main memory at the requested address, then the access is said to be a "hit". If none of the lines selected in accordance with the above indexing contains the data or instructions that reside in main memory at the requested address, then the access is said to be a "miss". When the cache is configured such that more than one line is associated with a given index, then the lines are typically referred to as "ways" of that index.
Some caches are capable of handling multiple accesses simultaneously. Caches configured in this way may have "banks" wherein the cache memory cells are configured into separately accessible portions. Therefore, one access can address one bank, and a second access a second, independent bank, and so on.
As superscalar microprocessor designers have continued to increase the number of instructions that are executed concurrently, caches have become an insufficient solution to the performance problems associated with large, slow memories. First, the caches are much smaller than the main memory. Therefore, it is always true that some data or instructions requested by the microprocessor will not be currently residing in the cache. The chips and/or silicon area required to build caches are expensive, so making the caches larger increases the overall computer system cost significantly. Second, caches typically hold data and instructions that have been previously requested by the microprocessor. Therefore, whenever the microprocessor begins a new program or accesses a memory location for the first time, a significant number of accesses to the main memory are required. When used in the context of a superscalar microprocessor as described herein, access means either a request for the contents of a memory location or the modification of the contents thereof. Third, in modern day microprocessors the amount of time necessary to access data or instructions in the cache is becoming a performance problem in the same way that DRAM access times have been.
In an attempt to solve some of the problems associated with caches, some microprocessors implement a "prefetching algorithm" wherein the microprocessor attempts to guess which memory locations it will be accessing in the near future and makes main memory requests for these locations. These schemes have had varying degrees of success. However, such schemes can also deleteriously affect the performance of the microprocessor in some situations. Whenever a significant number of wrong guesses are made, the microprocessor will replace data or instructions in the cache with the contents of memory locations that it does not need. This, in turn, will cause memory references to retrieve the data that had been replaced by the prefetched data.
Retrieving data from main memory is typically preformed in superscalar microprocessors through the use of a load instruction. This instruction may be explicit, wherein the load instruction is actually coded into the software being executed. This instruction may also be implicit, wherein some other instruction (an add, for example) directly requests the contents of a memory location as part of its input operands.
Storing the results of instructions back to main memory is typically preformed in superscalar microprocessors through the use of a store instruction. As with the aforementioned load instruction, the store instruction may be explicit or implicit. As used herein, "memory operations" will be used to refer to load and/or store instructions.
In modern superscalar microprocessors, memory operations are typically executed in one or more load/store units. These units execute the instruction, access the data cache (if one exists) attempting to find the requested data, and handle the result of the access. As described above, data cache access typically has one of two results: a miss or a hit.
A load/store unit typically also handles other, special conditions associated with memory operations: For example, an access may be "unaligned" or "misaligned". A memory operation requests or modifies data of a particular size, typically measured in bytes. The size for a particular memory operation depends on many things, including the architecture that the microprocessor is implemented to and the particular instruction that created the memory operation. A memory operation is said to be unaligned or misaligned if the address calculated by the memory operation does not have a number of zeros in its least significant binary digits (or "bits") equal to or greater than the sum of 2 raised to a power equal to the size of the requested datum and minus one. The formula for calculating the required number of least significant zeros is:
Unaligned accesses sometimes require multiple accesses to the data cache and/or memory.
Most instructions that a microprocessor executes ultimately received their operands from main memory or the data cache. The operands a particular instruction receives may have been requested from memory directly, or may be the result of some other instruction whose operands were requested from memory. Therefore, the performance of a superscalar microprocessor when running many programs is dependent in large part on how quickly the load/store unit can execute memory operations. In many superscalar microprocessors, the load/store unit executes one memory operation per clock cycle. Also, if a memory operation is found to miss the data cache, the load/store unit often ceases instruction execution until the missed address has been transferred from main memory. Thus, a memory operation that misses the data cache "blocks" subsequent memory operations from executing, even if they may hit the data cache. Blocking the subsequent memory accesses in many cases deleteriously affects performance of the superscalar microprocessor because instructions that require the data from the memory accesses cannot execute as quickly as might otherwise be possible.
Some superscalar microprocessor attempt to solve the aforementioned blocking problem by placing miss requests into a buffer between the data cache and the main memory interface. The buffer may be configured, for example, as a queue with a certain number of entries. While this buffering mechanism does help solve the blocking problem, more silicon area on the microprocessor chip is necessary to implement the buffers and the associated control functions. Furthermore, complexities are introduced in the form of comparators between accesses to the cache and the accesses that are currently queued. Without these comparators, multiple requests to the same miss line would be allowed into the buffer, causing multiple transfers to and from main memory to occur, thus deleteriously affecting performance. Only one transfer to or from main memory is necessary; as a result, the other memory operations that access the same line may fetch their data from the data cache. If more than one transfer to or from main memory of a given line are queued, these extraneous transfers will delay further requests for main memory, deleteriously affecting performance. Exemplary forms of superscalar microprocessors implementing such a buffering solution include the PowerPC 601 microprocessor produced by IBM Corporation and Motorola, Inc., and the Alpha 21164 microprocessor produced by Digital Equipment Corporation.
Another component of a load/store unit that may directly affect performance of a superscalar microprocessor is the number of buffer entries that store memory operations awaiting operands or an opportunity to access the data cache. In many implementations, a queue structure is used for the buffer. Typically, a buffer is provided for load memory operations and another, separate buffer is provided for store memory operations. When one of these buffers fills, a subsequent memory operation of that type may stall instruction execution of the entire microprocessor until it is allowed into the buffer, deleteriously affecting performance. Memory operations are placed into these buffers when dispatched to the load/store unit and are removed when data cache access is attempted, or sometime thereafter. When used in the context of operating on a memory operation, the term "remove" refers to the act of invalidating the storage location containing the memory operation. The act of invalidating may be accomplished, for example, by changing the state of a particular bit associated with the storage location or overwriting the storage location with a new memory operation. Much of the design time for a load/store unit is dedicated to choosing the size of these buffers such that the amount of processor stall time due to these buffers being full is minimized. The choice is further complicated by the fact that buffers require silicon area to implement, so an arbitrarily large number of queues cannot be used. The choice is still further complicated by the fact that the mix of instructions in common software programs is constantly changing, such that studying older programs to choose queue sizes may result in a less than optimal design.
SUMMARY OF THE INVENTION
The problems outlined above are in large part solved by a superscalar microprocessor employing a load/store unit with a unified load/store buffer in accordance with the present invention. In one embodiment, a load/store buffer is provided which allows both load memory operations and store memory operations to be stored within it. Because each storage location may contain either a load or a store memory operation, the number of available storage locations for load memory operations is maximally the number of storage locations in the entire buffer. Similarly, the number of available storage locations for store memory operations is maximally the number of storage locations in the entire buffer.
In the case where a program executes a large number of consecutive load memory operations, a device of the present invention will not cause a stall of instruction execution until the entire buffer is filled with load memory operations. Similarly, in the case where a program executes a large number of consecutive store memory operations, a device of the present invention will not cause a stall of instruction execution until the entire buffer is filled with store memory operations. For previous implementations to equal such performance, a load buffer equal in number of storage locations to the number of storage locations in a device of the present invention and a store buffer equal in number of storage locations to the number of storage locations in a device of the present invention would be required. More importantly, some information that is stored in the unified buffer for load and store memory operations is similar, and thus can be stored in the same position within a given storage location. For the case of separate buffers, these storage location positions are duplicated for each buffer. Hence, the silicon area needed to implement a buffer in accordance with the present invention versus the silicon area needed to implement separate buffers with comparable performance characteristics is considerably less than half.
This embodiment further solves the problem of choosing the number of buffer storage locations to allocate for load memory operations and store memory operations. In this embodiment, stall conditions will only occur due to the total number of pending memory operations, instead of the number of pending load memory operations or the number of store operations. Thus, for a given number of storage locations, the single buffer will perform better in most circumstances than a set of separate buffers with total number of storage locations among the plurality of separate buffers equal to the number of storage locations in the unified buffer. More importantly, as instruction mixes in common programs change over time, the single buffer will still perform well, where the separate buffer approach may be deleteriously affected.
For example, a microprocessor designer might determine that the current instruction mix in common programs requires three load buffers and one store buffer to preform well. Over time, as program compilers improve and the programs that are commonly run change, the instruction mixes may change. When the instruction mixes change, the optimal number of buffers might change. As an example, the optimal number of buffers might become two load and two store buffers. When the one store buffer is full and a second store attempts to execute, a stall condition would occur until the first store completes. However, if a buffer according to the present invention were used with, as an example, four buffers, then when the older code is run, it would tend to operate with three load memory operations and one store memory operation in it when full. More importantly, when the newer code is run, the buffer would tend to operate with two load memory operations and two store memory operations in it when full. No new stall conditions would occur, and performance would be better. Even more importantly, the prior art buffers preform well for the average instruction mix over many programs. However, no single program contains exactly that average. The prior art buffers will be insufficient for some of the programs studied. The buffer of the present invention, however, is more flexible in that it dynamically allocates its buffers to load or store memory operations, and therefore is more likely to be sufficient for a wide variety of programs.
In another embodiment, a device of the present invention is configured to store memory requests that miss the data cache until such time as they are allowed to make a main memory request. In this way, other memory operations that may be waiting for an opportunity to access the data cache may make such accesses, while the memory operations that have missed await an opportunity to make a main memory request. Therefore, the device of the present invention solves the aforementioned "blocking" problem.
One miss is permitted to make a request to main memory, and when the line associated with the request is stored into the data cache, misses are allowed to reaccess the data cache. Those whose addresses are contained in the newly received line will then be completed as data cache hits. This implementation advantageously removes the buffers used in previous implementations to store data cache misses, along with some of the control necessary to operate those buffers. In particular, the comparators that were required to restrict accesses to one per missed line are removed. Instead, the misses remain in the unified buffer until one miss is transferred into the cache from main memory, then misses are attempted to the data cache again. If a memory operation remains a miss after this access, it will continue to reside in the buffer, and another request for main memory transfer will be initiated.
In another embodiment, the load/store unit executes unaligned memory operations. Unaligned load memory operations are executed in consecutive clock cycles with consecutive accesses to the data cache. Unaligned store memory accesses are executed as simultaneous accesses on separate ports of the data cache. Thus, the device of the present invention is configured to correctly execute unaligned memory operations.
In yet another embodiment, the load/store unit executes multiple memory operations simultaneously as long as the memory operations do not access the same bank. This embodiment can therefore be connected to a data cache that is configured to accept simultaneous requests only in so far as they do not access the same bank.
In still a further embodiment, the load/store unit contains a buffer whose storage locations are allocated for memory operations according to a pointer. Load and store memory operations can then be intermixed in the buffer.
In another embodiment, the load/store unit contains a buffer whose storage locations are allocated for store memory operations according to one pointer (herein called a store pointer) and whose storage locations are allocated for load memory operations according to another pointer (herein called a load pointer). The store pointer advances from one end of the buffer and the load pointer advances from the other end. Therefore, load instructions are placed into the buffer starting at one end and store instructions are placed into the buffer from the other end. This embodiment maintains the separation of load and store memory operations that two separate buffer solutions have, while allowing any storage location to be used for either a load or a store.
In another embodiment, load memory operations are removed from the buffer when the memory operation is determined to hit the data cache, or when they are cancelled by the reorder buffer. Store memory operations are removed from the buffer when they are determined to hit and are indicated by the reorder buffer to be non-speculative, or when they are cancelled by the reorder buffer.
Broadly speaking, the invention contemplates a load/store unit comprising a buffer, an input control unit, and an output control unit. The buffer includes a plurality of storage locations configured to store information regarding pending memory operations. The buffer further includes an input port configured to receive the memory operation information. The buffer also includes a data cache port configured to communicate data access commands to a data cache.
The input control unit of the invention is coupled to the buffer, and is configured to direct the transfer of memory operation information from the input port to a particular storage location within the buffer.
The output control unit of the invention is similarly coupled to the buffer, and is configured to select a memory operation stored within one of the plurality of storage locations within the buffer to access the data cache. The output control unit is further configured to direct data cache access commands associated with the operation to the data cache. Also, the output control unit configured to remove a load or store memory operation from the buffer under certain conditions.
The invention further contemplates a method for operating a load/store unit comprising several steps. First, memory operations are accepted into the buffer from the input port. Second, information associated with the memory operations is accepted into the buffer from the input port. Third, memory operations are selected to access the data cache in a given clock cycle. Fourth, memory operations that have hit the data cache are removed from the buffer. Fifth, memory operations that have been cancelled are removed from the buffer.
BRIEF DESCRIPTION OF THE DRAWINGS
Other objects and advantages of the invention will become apparent upon reading the following detailed description and upon reference to the accompanying drawings in which:
FIG. 1 is a block diagram of a superscalar microprocessor which includes a load/store unit coupled to a data cache, 6 functional units and 6 decode units.
FIG. 2 is a block diagram of a load/store unit in accordance with the present invention coupled to a data cache.
FIG. 3 is a block diagram of a load/store buffer in accordance with the present invention.
FIG. 4A is a diagram of a storage location from the load/store buffer shown in FIG. 3.
FIG. 4B is a diagram of several clock cycles indicating when certain information arrives at the load/store buffer shown in FIG. 3 and certain other functions associated with operating the load/store buffer.
FIG. 4C is a block diagram showing store data forwarding for loads that access memory locations that are currently represented by stores in the load/store buffer.
FIG. 4D is a block diagram showing the layout of various sections of the load/store unit of the present invention.
FIG. 5 is a block diagram of a superscalar microprocessor.
FIG. 6 is a block diagram of a pipeline for calculating addresses within processor 500.
FIG. 7 shows a programmer's view of the x86 register file.
FIG. 8 is a block diagram which shows the speculative hardware for the stack relative cache 520.
FIG. 9 is a block diagram which illustrates portion of an exemplary embodiment of processor 500.
FIG. 10 is a block diagram of the alignment and decode structure of processor 500.
FIGS. 11, 12, 13 and 14 show the cycle during which each instruction would be decoded and issued, and to which issue positions each instruction would be dispatched.
FIG. 15 illustrates processor 500 pipeline execution cycle with a branch misprediction detected.
FIG. 16 illustrates processor 500 pipeline execution cycle with a successful branch prediction.
FIGS. 17, 18, 19 and 20 are block diagrams of instruction cache 502.
FIG. 21 is a block diagram of a global branch predictor.
FIG. 22 is a block diagram of the ICNXTBLK block.
FIG. 23 is a block diagram of the ICPREFIX block.
FIGS. 24 and 25 are block diagrams of ICALIGN block.
FIG. 26 shows an embodiment of the ICCNTL state machine.
FIG. 27 is a block diagram of the Icache and fetching mechanism.
FIG. 28 shows the conditions necessary to validate the instruction and each byte.
FIG. 29 is a block diagram of hardware within processor 500 which is used to calculate linear addresses and identify register operands.
FIG. 30 is a block diagram showing how operands are identified and provided to the reservation stations and functional units.
FIG. 31 is a block diagram of the return stack mechanism.
FIG. 32 is a block diagram of the MROM Interface Unit (MIU).
FIG. 33 is a block diagram showing how processor 500 extends the register set for MROM instructions.
FIG. 34 is a block diagram of how two-cycle fast path instructions are handled.
FIG. 35 is a block diagram of the layout of the processor 500 instruction decode unit.
FIG. 36 is a block diagram showing how the LOROB interfaces with other processor 500 units.
FIG. 37 shows the layout of the result data of the LOROB, the stack cache, and the register file.
FIG. 38 is a block diagram of the matrix for dependency checking in the LOROB.
FIG. 39 is a block diagram showing the dependency checking required for store operations.
FIG. 40 is a block diagram showing the dependency checking required for load operations.
FIG. 41 is a block diagram of a layout of the LOROB.
FIG. 42 is a block diagram of the stack cache.
FIG. 43 is a block diagram of the look-ahead ESP and EBP register models.
FIG. 44 is a block diagram of the current within line dependency checking unit.
FIG. 45 is a block diagram illustrating how the last in line bits are set.
FIG. 46 is a block diagram illustrating the previous lines dependency checking operation performed in the LOROB.
FIG. 47 is a block diagram showing portions of processor 500 which interface with the register file and special register block.
FIG. 48 is a block diagram of a reservation station.
FIG. 49 is a block diagram of the bus structure for the reservation stations.
FIG. 50 is a reservation station timing diagram.
FIG. 51 is a block diagram of a functional unit.
FIG. 52 is a code sequence showing how the same instructions could receive tags/operands from different sources.
FIG. 53 is a block diagram of the load/store section.
FIG. 54 is a block diagram of the unified load-store buffer.
FIG. 55 is a block diagram of a load-store buffer entry.
FIG. 56 is a timing diagram showing when the different fields in each entry of the buffer are updated.
FIG. 57 is a block diagram which illustrates store data forwarding for loads.
FIG. 58 shows a layout configuration of the LSSEC.
FIG. 59 shows the relative position of the LSSEC with respect to other units.
FIG. 60 is a block diagram of the data cache.
FIG. 61 is a block diagram of a tag array entry.
FIG. 62 is a block diagram of a way prediction entry.
FIG. 63 is a timing diagram for dcache load accesses.
FIG. 64 is a block diagram showing way prediction array entry usage for loads.
FIG. 65 is a timing diagram for dcache store accesses.
FIG. 66 is a timing diagram for unaligned load accesses.
FIG. 67 is a timing diagram for unaligned store accesses.
FIG. 68 is a timing diagram for DC/SC line transfers.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
DETAILED DESCRIPTION OF THE INVENTION
Turning now to the drawings, FIG. 1 shows a block diagram of a superscalar microprocessor 200 including a load/store unit 222 in accordance with the present invention. As illustrated in the embodiment of FIG. 1, superscalar microprocessor 200
includes a prefetch/predecode unit 202 and a branch prediction unit 220 coupled to an instruction cache 204. Instruction alignment unit 206 is coupled between instruction cache 204 and a plurality of decode units 208A-208F (referred to collectively as decode units 208). Each decode unit 208A-208F is coupled to a respective reservation station unit 210A-210F (referred collectively as reservation stations 210), and each reservation station 210A-210F is coupled to a respective functional unit 212A-212F (referred to collectively as functional units 212). Decode units 208, reservation stations 210, and functional units 212 are further coupled to a reorder buffer 216, a register file 218 and a load/store unit 222. A data cache 224 is finally shown coupled to load/store unit 222, and an MROM unit 209 is shown coupled to instruction alignment unit 206.
Generally speaking, instruction cache 204 is a high speed cache memory provided to temporarily store instructions prior to their dispatch to decode units 208. In one embodiment, instruction cache 204 is configured to cache up to 32 kilobytes of instruction code organized in lines of 16 bytes each (where each byte consists of 8 bits). During operation, instruction code is provided to instruction cache 204 by prefetching code from a main memory (not shown) through prefetch/predecode unit 202. It is noted that instruction cache 204 could be implemented in a set-associative, a fully-associative, or a direct-mapped configuration.
Prefetch/predecode unit 202 is provided to prefetch instruction code from the main memory for storage within instruction cache 204. In one embodiment, prefetch/predecode unit 202 is configured to burst 64-bit wide code from the main memory into instruction cache 204. It is understood that a variety of specific code prefetching techniques and algorithms may be employed by prefetch/predecode unit 202.
As prefetch/predecode unit 202 fetches instructions from the main memory, it generates three predecode bits associated with each byte of instruction code: a start bit, an end bit, and a "functional" bit. The predecode bits form tags indicative of the boundaries of each instruction. The predecode tags may also convey additional information such as whether a given instruction can be decoded directly by decode units 208 or whether the instruction must be executed by invoking a microcode procedure controlled by MROM unit 209, as will be described in greater detail below.
Table 1 indicates one encoding of the predecode tags. As indicated within the table, if a given byte is the first byte of an instruction, the start bit for that byte is set. If the byte is the last byte of an instruction, the end bit for that byte is set. If a particular instruction cannot be directly decoded by the decode units 208, the functional bit associated with the first byte of the instruction is set. On the other hand, if the instruction can be directly decoded by the decode units
208, the functional bit associated with the first byte of the instruction is cleared. The functional bit for the second byte of a particular instruction is cleared if the opcode is the first byte, and is set if the opcode is the second byte. It is noted that in situations where the opcode is the second byte, the first byte is a prefix byte. The functional bit values for instruction byte numbers 3-8 indicate whether the byte is a MODRM or an SIB byte, as well as whether the byte contains displacement or immediate data.
TABLE 1 ______________________________________ Encoding of Start, End and Functional Bits Instr. Start End Functional Byte Bit Bit Bit Number Value Value Value Meaning ______________________________________ 1 1 X 0 Fast decode 1 1 X 1 MROM instr. 2 0 X 0 Opcode is first byte 2 0 X 1 Opcode is this byte, first byte is prefix 3-8 0 X 0 Mod R/M or SIB byte 3-8 0 X 1 Displacement or immediate data; the second functional bit set in bytes 3-8 indicates immediate data 1-8 X 0 X Not last byte of instruction 1-8 X 1 X Last byte of instruction ______________________________________
As stated previously, in one embodiment certain instructions within the x86 instruction set may be directly decoded by decode unit 208. These instructions are referred to as "fast path" instructions. The remaining instructions of the x86
instruction set are referred to as "MROM instructions". MROM instructions are executed by invoking MROM unit 209. When an MROM instruction is encountered, MROM unit 209 parses and serializes the instruction into a subset of defined fast path instructions to effectuate a desired operation. A listing of exemplary x86 instructions categorized as fast path instructions as well as a description of the manner of handling both fast path and MROM instructions will be provided further below.
Instruction alignment unit 206 is provided to channel or "funnel" variable byte length instructions from instruction cache 204 to fixed issue positions formed by decode units 208A-208F. Instruction alignment unit 206 is configured to channel instruction code to designated decode units 208A-208F depending upon the locations of the start bytes of instructions within a line as delineated by instruction cache 204. In one embodiment, the particular decode unit 208A-208F to which a given instruction may be dispatched is dependent upon both the location of the start byte of that instruction as well as the location of the previous instruction's start byte, if any. Instructions starting at certain byte locations may further be restricted for issue to only one predetermined issue position. Specific details follow.
Before proceeding with a detailed description of the load/store unit 222, general aspects regarding other subsystems employed within the exemplary superscalar microprocessor 200 of FIG. 1 will be described. For the embodiment of FIG. 1, each of the decode units 208 includes decoding circuitry for decoding the predetermined fast path instructions referred to above. In addition, each decode unit 208A-208F routes displacement and immediate data to a corresponding reservation station unit
210A-210F. Output signals from the decode units 208 include bit-encoded execution instructions for the functional units 212 as well as operand address information, immediate data and/or displacement data.
The superscalar microprocessor of FIG. 1 supports out of order execution, and thus employs reorder buffer 216 to keep track of the original program sequence for register read and write operations, to implement register renaming, to allow for speculative instruction execution and branch misprediction recovery, and to facilitate precise exceptions. As will be appreciated by those of skill in the art, a temporary storage location within reorder buffer 216 is reserved upon decode of an instruction that involves the update of a register to thereby store speculative register states. Reorder buffer 216 may be implemented in a first-in-first-out configuration wherein speculative results move to the "bottom" of the buffer as they are validated and written to the register file, thus making room for new entries at the "top" of the buffer. Other specific configurations of reorder buffer 216 are also possible, as will be described further below. If a branch prediction is incorrect, the results of speculatively-executed instructions along the mispredicted path can be invalidated in the buffer before they are written to register file 218.
The bit-encoded execution instructions and immediate data provided at the outputs of decode units 208A-208F are routed directly to respective reservation station units 210A-210F. In one embodiment, each reservation station unit 210A-210F is capable of holding instruction information (i.e., bit encoded execution bits as well as operand values, operand tags and/or immediate data) for up to three pending instructions awaiting issue to the corresponding functional unit. It is noted that for the embodiment of FIG. 1, each decode unit 208A-208F is associated with a dedicated reservation station unit 210A-210F, and that each reservation station unit 210A-210F is similarly associated with a dedicated functional unit 212A-212F. Accordingly, six dedicated "issue positions" are formed by decode units 208, reservation station units 210 and functional units 212. Instructions aligned and dispatched to issue position 0 through decode unit 208A are passed to reservation station unit 210A and subsequently to functional unit 212A for execution. Similarly, instructions aligned and dispatched to decode unit 208B are passed to reservation station unit 210B and into functional unit 212B, and so on.
Upon decode of a particular instruction, if a required operand is a register location, register address information is routed to reorder buffer 216 and register file 218 simultaneously. Those of skill in the art will appreciate that the x86
register file includes eight 32 bit real registers (i.e., typically referred to as EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP), as will be described further below. Reorder buffer 216 contains temporary storage locations for results which change the contents of these registers to thereby allow out of order execution. A temporary storage location of reorder buffer 216 is reserved for each instruction which, upon decode, modifies the contents of one of the real registers. Therefore, at various points during execution of a particular program, reorder buffer 216 may have one or more locations which contain the speculatively executed contents of a given register. If following decode of a given instruction it is determined that reorder buffer 216
has previous location(s) assigned to a register used as an operand in the given instruction, the reorder buffer 216 forwards to the corresponding reservation station either: 1) the value in the most recently assigned location, or 2) a tag for the most recently assigned location if the value has not yet been produced by the functional unit that will eventually execute the previous instruction. If the reorder buffer has a location reserved for a given register, the operand value (or tag) is provided from reorder buffer 216 rather than from register file 218. If there is no location reserved for a required register in reorder buffer 216, the value is taken directly from register file 218. If the operand corresponds to a memory location, the operand value is provided to the reservation station unit through load/store unit 222.
Details regarding suitable reorder buffer implementations may be found within the publication "Superscalar Microprocessor Design" by Mike Johnson, Prentice-Hall, Englewood Cliffs, N.J., 1991, and within the co-pending, commonly assigned patent application entitled "High Performance Superscalar Microprocessor", Ser. No. 08/146,382, filed Oct. 29, 1993 by Witt, et al. These documents are incorporated herein by reference in their entirety.
Reservation station units 210A-210F are provided to temporarily store instruction information to be speculatively executed by the corresponding functional units 212A-212F. As stated previously, each reservation station unit 210A-210F may store instruction information for up to three pending instructions. Each of the six reservation stations 210A-210F contain locations to store bit-encoded execution instructions to be speculatively executed by the corresponding functional unit and the values of operands. If a particular operand is not available, a tag for that operand is provided from reorder buffer 216 and is stored within the corresponding reservation station until the result has been generated (i.e., by completion of the execution of a previous instruction). It is noted that when an instruction is executed by one of the functional units 212A-212F, the result of that instruction is passed directly to any reservation station units 210A-210F that are waiting for that result at the same time the result is passed to update reorder buffer 216 (this technique is commonly referred to as "result forwarding"). Instructions are issued to functional units for execution after the values of any required operand(s) are made available. That is, if an operand associated with a pending instruction within one of the reservation station units 210A-210F has been tagged with a location of a previous result value within reorder buffer 216 which corresponds to an instruction which modifies the required operand, the instruction is not issued to the corresponding functional unit 212 until the operand result for the previous instruction has been obtained. Accordingly, the order in which instructions are executed may not be the same as the order of the original program instruction sequence. Reorder buffer 216 ensures that data coherency is maintained in situations where read-after-write dependencies occur.
In one embodiment, each of the functional units 212 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations. It is noted that a floating point unit (not shown) may also be employed to accommodate floating point operations.
Each of the functional units 212 also provides information regarding the execution of conditional branch instructions to the branch prediction unit 220. If a branch prediction was incorrect, branch prediction unit 220 flushes instructions subsequent to the mispredicted branch instruction that have entered the instruction processing pipeline, and causes prefetch/predecode unit 202 to fetch the required instructions from instruction cache 204 or main memory. It is noted that in such situations, results of instructions in the original program sequence which occur after the mispredicted branch instruction are discarded, including those which were speculatively executed and temporarily stored in load/store unit 222 and reorder buffer
216. Exemplary configurations of suitable branch prediction mechanisms are well known.
Results produced by functional units 212 are sent to the reorder buffer 216 if a register value is being updated, and to the load/store unit 222 if the contents of a memory location is changed. If the result is to be stored in a register, the reorder buffer 216 stores the result in the location reserved for the value of the register when the instruction was decoded. As stated previously, results are also broadcast to reservation station units 210A-210F where pending instructions may be waiting for the results of previous instruction executions to obtain the required operand values.
Data cache 224 is a high speed cache memory provided to temporarily store data being transferred between load/store unit 222 and the main memory subsystem. In one embodiment, data cache 224 has a capacity of storing up to eight kilobytes of data. It is understood that data cache 224 may be implemented in a variety of specific memory configurations, including a set associative configuration.
Generally speaking, load/store unit 222 provides an interface between functional units 212A-212F and data cache 224. In one embodiment, load/store unit 222 is configured with a load/store buffer with sixteen storage locations for data and address information for pending load or store memory operations, wherein the storage locations are configured as a linear array of storage locations. However, it is understood that the number of storage locations may vary in further embodiments of the invention. Functional units 212 arbitrate for access to the load/store unit 222. When the buffer is full, a functional unit must wait until the load/store unit 222 has room for the pending load or store request information. The load/store unit 222
also performs dependency checking for load memory operations against pending store memory operations to ensure that data coherency is maintained. Load memory operations may be executed by the load/store unit 222 in a different order than they are provided to the load/store unit 222. Store memory operations are always executed in the order that they were provided.
In one embodiment, decode units 208 indicate to the load/store unit 222 what kind of memory operation each decode unit is decoding in a given cycle. The decode units 208 will indicate one of four possible conditions: no load/store operation has been decoded, a load operation has been decoded, a store operation has been decoded, or a load-op-store operation has been decoded. Load-op-store operations occupy two storage locations in the load/store buffer, one for the load operation and one for the store operation. These operations are then treated as independent operations in the load/store buffer. At least one clock cycle later, the address and the data (for stores) is provided by the functional units 212 to the load/store unit 222. This information is transferred into the storage location that holds the memory operation that the address and data is associated with. This association is determined by comparing reorder buffer tags provided by functional units 212 to reorder buffer tags previously stored in the load/store buffer.
In one embodiment, load and store memory operations that are stored in the load/store buffer are indicated to be no longer speculative by at least one pointer from reorder buffer 216. The pointer is a tag value which can be compared by the load/store unit 222 to the tags stored in the plurality of storage locations within the load/store buffer to update the speculative status of the memory operations stored therein. In another embodiment, the number of pointers provided by the reorder buffer 216 is two.
In one embodiment, the load/store unit selects up to two memory operations per clock cycle to access the data cache. The load/store unit uses a fixed priority scheme for making the selection. The scheme is as follows: stores that are no longer speculative are highest priority, loads that are misses and are no longer speculative are second highest priority, and loads that are speculative and have not yet accessed the cache are last in priority. Stores are higher priority than loads because they are the oldest instructions in the reorder buffer when they are no longer speculative, and it is desirable to retire them as quickly as possible. Load misses also are not processed until they are non-speculative due to the long latency of main memory transfers. If the load is cancelled, the data will not be useful but the long latency transfer will continue, possibly blocking other transfers needing access to main memory.
Other considerations that also affect which memory operations are selected to access the data cache are: the alignment of the operation and the bank of the data cache that an operation is going to access. If a load memory operation is selected for the first access of a given cycle and is unaligned, then the second access selected will be either an aligned memory operation or the second access will not be made in the current cycle. In the next cycle, the second half of the unaligned load memory operation is selected as the first access. If a store memory operation is selected for the first access of a given cycle and is unaligned, then the second access made in that cycle is the second half of the store memory operation. If either store access misses the data cache, both halves are aborted and the line that contains the miss is transferred to the data cache from main memory. If an aligned memory operation is selected as the first access and an unaligned load memory operation is selected as the second access, then in the next clock cycle the second access selected will be the second half of the unaligned load memory operation. If an aligned memory operation is selected as the first access and an unaligned store memory operation is selected as the second access, then the second access will not be made in this clock cycle.
Bank conflicts are also considered by the load/store unit in selecting memory operations to access the data cache in a given cycle. If two operations have been selected to access the data cache in a given cycle, and bits 2, 3, and 4 of their respective addresses are equal, then the second access will not be made in this cycle.
In another embodiment, a load memory operation is selected to access data cache 224 in a given cycle if load memory operations prior to the load memory operation in program order have accessed data cache 224 and been found to miss. The prior memory operations remain within the buffer and therefore require no extra buffers to store them, saving silicon area.
As will be shown in FIG. 4A, each entry in the load/store buffer of load/store unit 222 contains a miss/hit bit. The miss/hit bit is used in the selection of memory operations to access the data cache, in order to implement the non-blocking function. The miss/hit bit disqualifies load memory operations that are speculative from selection for access to the data cache. In this way, a speculative load memory operation that is subsequent to a speculative load memory operation that misses the data cache may be selected to access the data cache. Therefore, load/store unit 222 implements a non-blocking scheme in which load memory operations are allowed to access the data cache in clock cycles in which speculative load memory operations that have missed the data cache exist in the load/store buffer. In one embodiment, 8 locations (starting from the bottom of the load/store buffer) are scanned for such load memory operations, allowing up to 7 speculative load misses to be stored in the load/store buffer before blocking occurs.
Another important factor in the non-blocking scheme of load/store unit 222 is that the comparators required by previous non-blocking schemes to ensure that only one request per cache line is made to the main memory system are not required. As noted above, these comparators are necessary in prior non-blocking schemes to keep a second miss to the same line as a miss already queued for access to the main memory system from accessing the memory system. Typically in these previous schemes, when a second request is made for the line currently being fetched from main memory, blocking occurs. Load/store unit 222 holds the misses in the load/store buffer. When one miss becomes non-speculative, it accesses main memory while other misses remain in the buffer. When the data associated with the address that missed is transferred into data cache 224, the miss/hit bits in the load/store buffer are reset such that the associated memory operations are no longer considered to be misses. Therefore, the associated memory operations will be selected to access data cache 224 in a subsequent clock cycle. If the memory operation is now a hit, it completes in the same manner as other speculative load memory operations that hit the data cache. If the memory operation is still a miss, the miss/hit bit is set to indicate miss, and the memory operation waits to become non-speculative. Therefore, the comparators are not necessary and multiple misses to the same cache line do not cause blocking.
In one embodiment, load memory operations are selected for removal from the load/store buffer if the operation is a data cache hit. Load memory operations are further selected for removal if the load operation has missed the data cache, is no longer speculative (as indicated by the aforementioned reorder buffer pointers), and the line containing the miss is selected to be transferred from main memory (not shown) to the data cache. Store memory operations are selected for removal from the load/store buffer if the store memory operation is non-speculative (as indicated by the aforementioned reorder buffer pointers), and the store memory operation is a data cache hit. Store memory operations are further selected for removal from the load/store buffer if the store memory operation is non-speculative (as indicated by the aforementioned reorder buffer pointers), the store memory operation is a data cache miss, and the line containing the miss is selected to be transferred from main memory to the data cache. In another embodiment, memory operations are selected for removal from the load/store buffer if a cancel signal is received form reorder buffer 216, along with a reorder buffer tag that matches the memory operation.
Turning now to FIG. 2, a block diagram of a load/store unit in accordance with the present invention is shown. Load/store unit 222 is shown to include an input port 1000 for receiving memory operation commands and information associated with those operations. In one embodiment, up to six operations may be provided in a given clock cycle. The information comprises the linear address associated with the instruction and also data, if the memory operation is a store. This information is provided at least one clock cycle after the associated memory operation command is provided. As FIG. 2 shows, load/store unit 222 comprises input control unit 1001, store pointer 1002, load pointer 1003, load/store buffer 1004, output control unit 1005, input reorder buffer pointers 1006 and 1007, and data cache ports 1008. In one embodiment, load/store buffer 1004 is configured as a linear array of storage locations.
Input control unit 1001 directs memory operations 1000 to particular storage locations within load/store buffer 1004. In one embodiment, this direction is accomplished through the use of two pointers: store pointer 1002 and load pointer 1003. Each store memory operation that is received in a given clock cycle is transferred into a storage location within load/store buffer 1004 beginning at the storage location pointed to by store pointer 1002, and increasing in storage location numbers for each subsequent store memory operation received. Store pointer 1002 is then incremented by the number of store operations received in the clock cycle. Similarly, each load memory operation that is received in a given clock cycle is transferred into a storage location within load/store buffer 1004 beginning at the storage location pointed at by load pointer 1003, and decreasing in storage location numbers for each subsequent load memory operation received. Load pointer 1003 is then decremented by the number of load operations received in the clock cycle. It is the responsibility of the decode units 208 to dispatch only as many load and store memory operations as can be stored between store pointer 1002 and load pointer 1003. The load unit provides communication to the decode units 208 in the form of the difference between load pointer 1003 and store pointer 1002 to aid the decode unit in this function.
In one embodiment, when the load/store buffer is empty, store pointer 1002 points to the first storage location and load pointer 1003 to the last storage location in the load/store buffer 1004. The store pointer 1002 is incremented for each store memory operation received into the load/store buffer, and the load pointer 1003 is decremented for each load memory operation received into the load/store buffer. As load memory operations are removed from the load/store buffer 1004, the storage locations between load pointer 1003 and the end of the load/store buffer 1004 are copied into the storage locations below which are vacated by the removed load memory operations. The copying occurs in such a way that the remaining memory operations occupy contiguous positions at the end of load/store buffer 1004 and the remaining memory operations are still in program order. The removed load memory operations need not be contiguous in the buffer. The load pointer 1003 is then incremented by the number of load instructions removed. Similarly, as store memory operations are removed from the load/store buffer 1004, the storage locations between store pointer 1002 and the beginning of the load/store buffer 1004 are copied into the storage locations above which are vacated by the removed store memory operations. The copying occurs in such a way that the remaining memory operations occupy contiguous positions at the beginning of load/store buffer 1004 and the remaining memory operations are still in program order. The removed store memory operations need not be contiguous in the buffer. The store pointer 1002 is then decremented by the number of store memory operations removed.
Output control unit 1005 selects memory operations stored in load/store buffer 1004 for access to the data cache 224. In one embodiment, output control unit 1005 selects up to two memory operations for the aforementioned access. The output control unit 1005 implements the priority scheme described above for selecting the memory operations. Reorder buffer pointers 1006 and 1007 are used to indicate which memory operations are no longer speculative, as described above.
Turning next to FIG. 3, an embodiment of load/store buffer 1004 is shown in more detail. Shaded area 1010 depicts storage locations that are holding store memory operations. Shaded area 1011 depicts storage locations that are holding load memory operations. In this embodiment, the storage locations are configured as a linear array of locations. A linear array of locations is an organization of locations wherein each location can be located within the array utilizing a single number. Store memory operations are transferred into the buffer from one end, while load memory operations are transferred into the buffer from the opposite end. In this way, the properties of storing load memory operations and store memory operations in separate queuing structures are maintained. However, this embodiment advantageously makes use of a single set of storage locations to provide both load and store queuing locations. Hardware, and hence silicon area, are saved as compared to a performance-equivalent number of separate load and store buffers. For example, this embodiment contains 16 storage locations. At any given time up to 16 store memory operations, or alternatively 16 load memory operations, could be stored in the load/store buffer 1004. A performance-equivalent number of separate load and store buffers would therefore require 16 load buffers and 16 store buffers. Each of these buffers would be required to contain the same information that the load/store buffer
1004 contains. Therefore, the separate load and store buffer solution commonly used in superscalar microprocessors consumes considerably more silicon area than load/store buffer 1004.
Also shown in FIG. 3 are load pointer 1003 and store pointer 1002. Because this embodiment contains 16 storage locations, load pointer 1003 and store pointer 1002 are depicted as four bit pointers. In other embodiments, the number of storage locations may vary and therefore the number of bits that load pointer 1003 and store pointer 1002 require may vary as well. Other embodiments may also be configured with load/store buffer 1004 as some other organization than a linear array. For example, a two dimensional array might be used, in which a storage location is identified by a pointer consisting of two numbers: a row and a column number. It is understood that there are other possible ways to configure load/store buffer 1004. In one embodiment, store pointer 1002 is not allowed to become equal to or greater than load pointer 1003. In this way, load memory operations and store memory operations are stored in storage locations distinct from each other in any given clock cycle.
Also shown in FIG. 3 is an output LSCNT[2:0] 1012. This output is the difference between load pointer 1003 and store pointer 1002, and in one embodiment indicates how many memory operations may be transferred to the load/store unit 222. Units that transfer memory operations to the load/store unit 222 use this information in their algorithms to limit the number of memory operations transferred in a given clock cycle.
Turning now to FIG. 4A, a diagram of the storage locations within load/store buffer 1004 is shown. The storage locations are divided into three fields. In one embodiment, the first field consists of 6 bits. One bit is a valid bit, indicating when set that the storage location contains a memory operation and indicating when not set that the storage location does not contain a memory operation. The remaining five bits of the first field comprise a tag which indicates which entry in the reorder buffer 216 the memory operation is associated with.
The second field 1021 consists of 66 bits. The first 32 bits of the field are the address that the memory operation is to manipulate. The next bit is an address valid bit, indicating when set that the aforementioned address has been provided and indicating when not set that the aforementioned address has not been provided. The next 32 bits in field 1021 are the data associated with the memory operation. For stores, these bits contain the data that is to be stored at the aforementioned address. For data that is less than 32 bits wide, the data is stored in field 1021 in a right-justified manner. The final bit in field 1021 is a data valid bit, indicating when set that the aforementioned data has been provided and indicating when not set that the aforementioned data field has not been provided.
The third field 1022 of the storage locations contains other important information for each memory operation. In one embodiment, the following information is saved:
the size of the data to be manipulated measured in bytes;
the miss/hit state of the memory operation in data cache 224, wherein this bit being set indicates a miss and this bit not being set indicates that the operation has not accessed the data cache;
the dependent bit, wherein this bit being set indicates that a load memory operation is dependent on a store memory operation stored in another storage location of the load/store buffer and this bit not being set indicates that no such dependency exists;
the entry number of the storage location containing the aforementioned dependency, wherein this field contains random information if the aforementioned dependent bit is not set.
Other embodiments store additional miscellaneous information in field 1022.
Turning now to FIG. 4B, a timing diagram showing typical operation of one embodiment of the load/store unit is shown. Three complete clock cycles are shown, labeled ICLK4, ICLK5, and ICLK6. In ICLK4, load and/or store memory operations are received as indicated by arrow 1030. The load pointer is decremented by the number of load memory operations received in clock cycle ICLK4 at arrow 1031. The number of load memory operations received in a given clock cycle can be zero or more. The store pointer is also incremented by the number of store memory operations received in ICLK4 at arrow 1031. As with the load operations above, the number of store operations received in a given clock cycle can be zero or more. At arrow 1032, the load/store unit has calculated a new value for LSCNT 1012, which is the difference between the decremented value of load pointer 1003 and the incremented value of store pointer 1002.
At the beginning of ICLK5, as indicated by arrow 1033, the load/store unit 222 examines the tags of memory operations currently residing in the load/store buffer 1004, and begins the process of selecting operations to access the data cache for this cycle. The fixed priority scheme as described above is used as the selection criteria. At arrow 1034, tags for memory operations that are being provided with addresses and/or data are transferred to the load/store unit from functional units 212. This information is used in the selection process at arrow 1035. At arrow 1036, the selection process is complete and up to two access for the data cache have been selected. At arrow 1037, the address and data that were indicated as being transferred in this clock cycle (at arrow 1034) are provided by the functional units 212. The address and data are transferred into the storage locations within the load/store buffer at arrow 1038.
In clock cycle ICLK6, the data cache 224 is accessed. Also in this clock cycle, if one or both of the memory operations accessing the cache is a load memory operation, the tags of the load memory operations first field 1020 of FIG. 4A are compared to the tags of any stores that are currently stored in the load/store buffer. Simultaneously, the addresses of the load memory operation and any stores that are currently stored in the load/store buffer are compared. If the load memory operation is found to be after the store operation in program order via the aforementioned tag compare and the address of the load is found to completely overlap the address of the store via the aforementioned address compare, then the data that the load memory operation is attempting to retrieve is actually the data in the store memory operations storage location. This data is provided from the data portion of the store memory operation's storage location. In this context, "completely overlap" means that all of the bytes that the load memory operation is retrieving are contained within the bytes that the store memory operation is updating. Also, "partially overlap" means that some of the bytes that the load memory operation is retrieving are contained within the bytes that the store memory operation is updating. If the aforementioned data has not been provided to the load/store unit, or the address of the load memory operation partially overlaps the store memory operation, then the load memory operation does not retrieve its data in this cycle. Instead, it remains in the load/store buffer until the store memory operation is performed. If a store memory operation in the buffer is before a load memory operation in program order but the store memory operation does not yet contain a valid address for comparison, the load memory operation is treated as if the store address partially overlaps the load memory operation. If the load memory operation is found to be before any store memory operations that might be in the load/store buffer, or it the load memory operation's address does not match any of the store memory operation's address, then the data for the load memory operation is provided from the data cache. If the load memory operation is a data cache miss, and the conditions mentioned in the previous sentence are met, no data is provided for the load memory operation in this clock cycle.
At arrow 1039, the result of the operation is driven to the reorder buffer 216. At arrow 1040, the miss bit and the dependent bit in field 1022 (as shown in FIG. 4A) of the memory operations accessing the data cache in this clock cycle are updated with the miss/hit state of the access and any dependency on stores in the load/store buffer that was detected.
Turning now to FIG. 4C, exemplary hardware implementing the aforementioned memory operation dependency checking is shown. The arrows 1050 and 1051 indicate the addresses of the two memory operations selected to access the data cache in this clock cycle. The addresses are conveyed on a pair signal lines labeled LSLINADO[31:2] and LSLINAD1[31:2] for the first and second accesses, respectively. These addresses are compared to the addresses stored in each of the storage locations within load/store buffer 1004 using comparators 1052. Whether or not the addresses overlap is indicated at the output of the comparators. This information is input to control units 1053 and 1054, which also perform the tag comparisons mentioned above. If a tag comparison shows that the memory operation is after the operation residing in the load/store buffer and the address comparison shows complete overlap, then the store data is forwarded as outputs 1055 and 1056, respectively. This data is then used as the result of load memory operation. If the load memory operation depends on a store memory operation but that store memory operation's data has not been provided, then the load memory operations remains in the load/store buffer 1004 until the store memory operation's data is provided.
Turning now to FIG. 4D, a diagram of the load/store unit 222 is shown. The load/store unit 222 is divided into several partitions. LSCTL 1060 is the control block. This block contains the logic gates necessary to control the load/store buffer
1004, as well as other portions of the load/store unit. LDSTSTAT 1061 contains the status information for each of the storage locations in the load/store buffer. That is, LDSTSTAT 1061 contains the information of field 1022 of FIG. 4A. LDSTTAGS 1062
contains the information of field 1020 of FIG. 4A for each storage location of load/store buffer 1004. LDSTADR 1063 contains the address portion of field 1021 of FIG. 4A for each storage location of load/store buffer 1004. LDSTDAT 1064 contains the data portion of field 1021 of FIG. 4A for each storage location of load/store buffer 1004. Finally, LSSPREG 1065 contains segment registers, which are further described below.
FIG. 4D also shows inputs 1000 of FIG. 2, herein shown as the signals used in one embodiment. RTAGnB 1066 is a set of signals providing the tag that identifies the position of the memory operation within the reorder buffer 216. ITYPEnB 1067
identifies the memory operation as either a load, a store, or a load-op-store operation. RESLAnB 1072 provides the address for memory operations, and RESnB 1073 provides the data for store memory operations.
FIG. 4D also shows outputs of the load/store unit 222. LSRESO/XLSRESO 1068 is the data output for the first access to the data cache 224. The two sets of signals are provided as differential inputs to the reorder buffer. Similarly, LSRES1/XLSRES1 1069 is the data output for the second access to the data cache 224. Also, LSLINAD0 1070 and LSLINAD1 1071 are the addresses for the first and second data cache accesses, respectively.
Turning next to FIGS. 5-68, details regarding various aspects of another embodiment of a superscalar microprocessor are next considered. FIG. 5 is a block diagram of a processor 500 including an instruction cache 502 coupled to a prefetch/predecode unit 504, to a branch prediction unit 506, and to an instruction alignment unit 508. A set 510 of decode units is further coupled to instruction alignment unit 508, and a set 512 of reservation station/functional units is coupled to a load/store unit 514 and to a reorder buffer 516. A register file unit 518 and a stack cache 520 is finally shown coupled to reorder buffer 516, and a data cache 522 is shown coupled to load/store unit 514.
Processor 500 limits the addressing mechanism used in the x86 to achieve both regular simple form of addressing as well as high clock frequency execution. It also targets 32-bit O/S and applications. Specifically, 32-bit flat addressing is employed where all the segment registers are mapped to all 4GB of physical memory. the starting address being 0000-0000 hex and their limit address being FFFF hex. The setting of this condition will be detected within processor 500 as one of the conditions to allow the collection of accelerated datapaths and instructions to be enabled. The absence of this condition of 32-bit flat addressing will cause a serialization condition on instruction issue and a trapping to MROM space.
Another method to insure that a relatively high clock frequency may be accommodated is to limit the number of memory address calculation schemes to those that are simple to decode and can be decoded within a few bytes. We are also interested in supporting addressing that fits into our other goals, i.e., stack relative addressing and regular instruction decoding.
As a result, the x86 instruction types that are supported for load/store operations are:
______________________________________ push [implied ESP - 4] pop [implied ESP + 4] call [implied ESP + 8] ret (implied ESP - 8] load [base + 8-bit displacement] store [base + 8-bit displacement] oper. [EBP + 8-bit displacement] oper. (EAX + 8-bit displacement] ______________________________________
The block diagram of FIG. 6 shows the pipeline for calculating addressing within processor 500. It is noted that base +8/32 bit displacement takes 1 cycle, where using an index register takes 1 more cycle of delay in calculating the address. More complicated addressing than these requires invoking an MROM routine to execute.
A complete listing of the instruction sub-set supported by processor 500 as fast path instructions is provided below. All other x86 instructions will be executed as micro-ROM sequences of fast path instructions or extensions to fast path instructions.
The standard x86 instruction set is very limited in the number of registers it provides. Most RISC processors have 32 or greater general purpose registers, and many important variables can be held during and across procedures or processes during normal execution of routines. Because there are so few registers in the x86 architecture and most are not general purpose, a large percentage of operations are moves to and from memory. RISC architectures also incorporate 3 operand addressing to prevent moves from occurring of register values that are desired to be saved instead of overwritten.
The x86 instruction set uses a set of registers that can trace its history back to the 8080. Consequently there are few resisters, many side effects, and sub-registers within registers. This is because when moving to 16-bit, or 32-bit operands, mode bits were added and the lengths of the registers were extended instead of expanding the size of the register file. Modern compiler technology can make use of large register sets and have a much smaller percentage of loads and stores. The effect of these same compilers is to have a much larger percentage of loads and stores when compiling to the x86. The actual x86 registers are often relegated to temporary registers for a few clock cycles while the real operation destinations are in memory.
FIG. 7 shows a programmer's view of the x86 register file. One notes from this organization that there are only 8 registers. and few are general purpose. The first four registers, EAX, EDX, ECX, and EBX, have operand sizes of 8, 16, or 32-bits depending on the mode of the processor or instruction. The final 4 resisters were added with the 8086 and extended with the 386. Because there are so few real registers, they tend to act as holding positions for the passing of variables to and from memory.
The important thing to note is that when executing x86 instructions, one must be able to efficiently handle 8, 16, and 32-bit operands. If one is trying to execute multiple x86 instructions in parallel, it is not enough to simply multi-port the register file. This is because there are too few registers and all important program variables must be held in memory on the stack or in a fixed location.
Processor 500 achieves the affect of a large register file by multi-porting stack relative operations on the x86. Specifically, ESP or EBP relative accesses are detected, and upon a load or store to these regions a 32 byte data cache line is moved into an on-chip multi-port structure.
This structure is called a stack relative cache or stack cache (see FIG. 5). It contains a number of 32 byte cache lines that are multi-ported such that every issue position can simultaneously process a load or store. The accesses allowed are
8/16/32 bit accesses. 16 and 32-bit accesses are assumed to be aligned to natural boundaries. If this is not true, the access will take 2 consecutive cycles. The final optimization is that this structure for reads is contained in an early decode stage, the same stage that normal register file access is contained. Memory locations are also renamed so that speculative writes to the stack can be forwarded directly to subsequent operations.
The stack cache has two ports for each issue position. One port is for a load, and one port is for a store. Up to 8 cache lines, or 64 32-bit registers can be cached. Each 32-bit register can have 6 concurrent accesses. These cache lines are not contiguous, and the replacement algorithm for each cache line is LRU based. Unaligned accesses are handled as consecutive sequences of 2 reads and/or 2 writes, stalling, issue from that position until completion. The resulting two read accesses or write accesses are merged to form the final 16 or 32-bit access.
Thus an operation such as ADD EAX, [EBP+d8]=[EBP+d8] is encoded as one issue position. The load and store operations occur to the stack relative cache and not to the data cache. Up to 6 of these operations can issue in one clock cycle, and up to 6 operations can retire in one cycle. Also operations such as push that imply a store operation and a ESP relative decrement are directly executed, and multiple of these operations are allowed to occur in parallel.
FIG. 8 is a block diagram which shows the speculative hardware for the stack relative cache 520. Part of the first two pipeline stages decodes the accelerated subset and calculates the base pointer or stack pointer relative calculations to form the linear address before reaching the pipeline stage that accesses the stack relative register file and the line oriented reorder buffer. This will be discussed in greater detail below.
RISC designs employ regular instruction decoding along natural boundaries to achieve very high clock frequencies and also with a small number of pipeline stages even for very wide issue processors. This is possible because finding a large number of instructions and their opcodes is relatively straightforward, since they are always at fixed boundaries.
As stated previously, this is much more difficult in an x86 processor where there are variable byte instruction formats, as well as prefix bytes and SIB bytes that can effect the length and addressing/data types of the original opcode.
Processor 500 employs hardware to detect and send simple instructions to fixed issue positions, where the range of bytes that a particular issue position can use is limited. This may be compensated for by adding many issue positions that each instruction cache line can assume in parallel.
Once the instructions are aligned to a particular issue position, the net amount of hardware required to decode common instructions is not significantly greater than that of a RISC processor, allowing equivalent clock frequencies to be achieved. Processor 500 achieves high frequency, wide issue, and limited pipeline depth by limiting the instructions executed at high frequency to a sub-set of the x86 instructions under the conditions of 32-bit flat addressing.
Supporting a load/store memory architecture is possible within the constraints of the x86 instruction set if one redefines the meaning of register and memory. The reason for this redefinition is the x86 needs more than 8 resisters for optimal performance. The high performance RISC architecture use their large multi-ported register files to hold commonly referenced variables or constants. Thus, the inherently slower memory accesses can be limited to load and store operations, and the RISC can concentrate on building very wide issue hardware that executes directly on register/register operations.
As previously noted, many of the advantages of a large RISC register file can be achieved by multi-porting stack relative memory references, and keeping these structures in a multi-ported RAM array that can be read and written in the same pipeline stages as a register file on a RISC. There is also an advantage if these accesses are aligned to natural 16/32-bit boundaries, which is similarly a benefit to all existing x86 processors.
All operations that use this stack addressing subset can be treated as register like instructions that can be speculatively executed identical to the normal x86 registers. The remaining memory accesses may then be treated as being load/store operations by supporting these through access to a conventional data cache, but where the data cache is pipelined and performs accesses at accelerated clock frequencies.
Hardware detects and forwards memory calculations that hit in the current entries in the stack relative cache since it is possible for addressing modes outside of stack relative accesses to indirectly point to this same region of memory, and the stack cache is treated as modified memory. Because memory operations are a part of most x86 instructions, load/op/store operations may be converted to single issue operations. Processor 500 does this by allowing a single issue to contain as many as three distinct operations. If memory load and store operations outside of the stack relative cache are detected in decode, the pending operation is held in a reservation station, and the load access and addressing calculation are sent the multi-ported data cache. Upon completion of the load operation the reservation station is allowed to issue to the functional unit. Upon completion of execution, the result is either an x86 register or a pending store.
In either case the result is returned as completed to the entry in the reorder buffer. If a store, the store is held in speculative state in front of the data cache in a store buffer, from which point it can be speculatively forwarded from. The reorder buffer then can either cancel this store or allow it to writeback to the data cache when the line is retired.
All accesses to the stack relative cache can be renamed and forwarded to subsequent operations, identical to registers. This also includes references that are made as indirect non-stack relative accesses that store to the stack relative cache.
FIG. 9 is a block diagram which illustrates portions of an exemplary embodiment of processor 500 in greater detail. This structure is assumed to be capable of reading two data elements and writing two data elements per clock cycle at the accelerated clock frequency. Note that a mechanism must be maintained to allow the load and store operations to execute and forward speculatively while maintaining true program order.
The following set of instructions probably comprise 90% of the dynamically executed code for 32-bit applications:
8/32-bit operations
move reg/reg reg/mem
arithmetic operations reg/mem reg/reg logical operations reg/reg reg/mem push
logical operations reg/reg reg/mem
push
pop
call/return
load effective address
jump cc
jump unconditional
16-bit operations
prefix/move reg/reg
prefix/move reg/mem
prefix/arithmetic operations reg/reg, reg/mem
prefix/logical operations reg/reg reg/mem
prefix/push
prefix/pop
When executing 32-bit code under flat addressing, these instructions almost always fall within 1-8 bytes in length, which is in the same rough range of the aligned, accelerated fast path instructions.
FIG. 10 is a block representation of the alignment and decode structure of processor 500. This structure uses the instruction pre-decode information contained within each cache line to determine where the start and end positions are, as well as if a given instruction is an accelerated instruction or not.
Accelerated instructions are defined as fast-path instructions between 1 and 8 bytes in length. It noted that it is possible that the start/end positions predecoded reflect multiple x86 instructions, for instance 2 or 3 pushes that are predecoded in a row may be treated as one accelerated instruction that consumes 3 bytes.
When a cache line is fetched from the instruction cache, it moves into an instruction alignment unit which looks for start bytes within narrow ranges. The instruction alignment unit uses the positions of the start bytes of the instructions to dispatch the instructions to six issue positions. Instructions are dispatched such that each issue position accepts the first valid start byte within its range along with the next three bytes.
Four bytes is the maximum number of bytes which can include the prefix and opcode bytes of an instruction. A multiplexer in each decoder looks for the end byte associated with each start byte, where an end byte can be no more than seven bytes away from a start byte. The mechanism to scan for a constant value in an instruction over four bytes in length is given an extra pipeline stage due to the amount of time potentially required.
Note that instructions included in the subset of accelerated instructions, and which are over four bytes in length, always have a constant as the last 1/2/4 bytes. This constant is usually not needed until the instruction is issued to a functional unit, and therefore the determination of the constant value can be delayed in the pipeline. The exception is an instruction requiring an eight-bit displacement for an address calculation. The eight-bit displacement for stack-relative operations is always the third byte after the start byte, so this field will always be located within the same decoder as the rest of the instruction.
It is possible that a given cache line can have more instructions to issue than can be accommodated by the six entry positions contained in each line of the line-oriented reorder buffer. If this occurs, the line-oriented reorder buffer allocates a second line in the buffer as the remaining instructions are dispatched. Typically, in 32-bit application and O/S code, the average instruction length is about three bytes. The opcode is almost always the first two bytes, with the third byte being a sib byte specifying a memory address (if included), and the fourth byte being a 16-bit data prefix.
The assumption in the processor 500 alignment hardware is that if the average instruction length is three, then six dedicated issue positions and decoders assigned limited byte ranges should accommodate most instructions found within 16-byte instruction cache lines. If very dense decoding occurs (i.e., lots of one and two byte instructions), several lines are allocated in the line-oriented reorder buffer for the results of instructions contained in a few lines of the instruction cache. The fact that these more compact instructions are still issued in parallel and at a high clock frequency more than compensates for having some decoder positions potentially idle.
As an example, take the case of 8 two-byte instructions continually encoded within a cache line. This instruction sequence would have start bytes at positions:
4
6
8
10
12
14
FIG. 11 shows the cycle during which each instruction would be decoded and issued, and to which issue positions each instruction would be dispatched. Note that the instruction alignment unit uses no other advanced knowledge except the locations of the start bytes of each instruction. Entry positions in the line-oriented reorder buffer which correspond to issue positions which are not used during a given cycle are invalidated, and a new line is allocated in the line-oriented reorder buffer each cycle. This allows us to decode and align instructions at high speed without specifically knowing whether a given issue position is allocated an instruction in a given cycle.
A worst-case scenario might be a sequence of one-byte instructions (e.g., inc, push, inc, push, etc.). FIG. 12 shows the cycle during which each instruction would be decoded and issued, and to which issue positions each instruction would be dispatched. While the performance isn't spectacular, sequences of one-byte instructions are probably rarely encountered in code. The important point is that the mechanism does not break. Code typically contains two-byte, three-byte, and four-byte instructions mixed with one-byte instructions. With this mix, the majority of issue positions are allocated instructions. Long six-byte instructions are also rare, but if encountered, they are also directly executed.
FIG. 13 shows an example instruction sequence based on exemplary 32-bit application code. FIG. 14 shows the cycle during which each instruction would be decoded and issued, and to which issue positions each instruction would be dispatched. In this example, all branches are assumed not taken. Focusing on cycles 1-6 of FIG. 14, 26 x86 instructions are decoded/issued in six clock cycles. This reduces to 4.33 raw x86 instructions per clock cycle with this alignment technique.
FIG. 15 illustrates processor 500 pipeline execution cycles with a branch misprediction detected during cycle 6 and the resulting recovery operation. FIG. 16 similarly illustrates the processor 500 pipeline execution cycles for the equivalent seven stages assuming successful branch prediction and all required instruction and data present in the respective caches.
Description of Instruction Cache and Fetching Mechanism
Next the instruction cache organization, fetching mechanism, and pre-decode information will be discussed. As shown in FIGS. 17-20, the instruction cache (Icache) 502 of processor 500 includes blocks ICSTORE, ICTAGV, ICNXTBLK, ICCNTL, ICALIGN, ICFPC, and ICPRED. The instruction cache contains 32K bytes of storage and is an 8-way set associative cache, and is linearly addressed. The Icache is allowed more than one clock cycle to read and align the instructions to the decode units. The address is calculated in first half of ICLK, the data, tag, pre-decode, and predicting information are read in by the end of ICLK. In the next cycle, and the data are multiplexed from the tag comparison, and the instructions are aligned and sent to the decode units. The alignment multiplexing is accomplished as the tags are compared. The decode units can start decoding in the second half of this clock. The Icache includes a way-prediction which can be done in a single clock using the ICNXTBLK target. The branch prediction includes bimodal and global branch prediction which takes two clock cycles.
TABLE 6 ______________________________________ Signai list. ______________________________________ IRESET - Global signal used to reset ICACHE block. Clears all state machines to Idle/Reset. IDECJAMIC - Global signal from the LOROB. Used to indicate that an interrupt or trap is being taken. Effect on Icache is to ciear all pre-fetch or access in progress, and set all state machines to Idle/Reset. SUPERV - Input from LSSEC indicates the supervisor mode or user mode of the current accessed instruction. TR12DIC - Input from SRB indicates that all un-cached instructions must be fetched from the external memory. SRBINVILV - Input from SRB to invalidate the Icache by clear all valid bits. INSRDY - Input from BIU to indicates the valid external fetched instruction is on the INSB(63:0) bus. INSFLT - Input from BIU to indicates the valid but faulted external fetched instruction is on the INSB(63:0) bus. INSB(63:0) - Input from external buses for fetched instruction to the Icache. REMAP - Input from L2 indicates the instruction is in the Icache with different mapping. The L2 provides the way associative and new supervisor bit. The LV will be set in this case. PFREPLCOL(2:0) - Input from L2 indicates the way associative for writing of the ICTAGV. UPDFPC - input from LOROB indicate that a new Fetch PC has been detected. This signal accompanies the FPC for the Icache to begin access the cache arrays. TARGET(31:0) - Input from LOROB as the new PC for branch correction path. BRNMISP - Input from the Branch execution of the FU indicates that a branch mis-prediction. The Icache changes its state machine to access a new PC and clears all pending instructions. BRNTAKEN - Input from the LOROB indicate the status of the mis-prediction. This signal must be gated with UPDFPC. BRNFIRST - Input from the LOROB indicate the first or second target in the ICNXTBLK for updating the branch prediction. BRNCOL(3:0) - Input from the LOROB indicates the instruction byte for updating the branch prediction in the ICNXTBLK. FPCTYP - Input for the LOROB indicates the type of address that is being passed. to the Icache. BPC(11:0) - Input from the LOROB indicates the PC index and byte-pointer of the branch instruction which has been mis- predicted for updating the ICNXTBLK. MVTOSRIAD - Input from SRB, indicates a move to IAD special register, Icache needs to check its pointer against the pointer driven on IAD. MVFRSRIAD - Input from SRB, indicates a move from IAD special register, Icache needs to check its pointer against the pointer driven on IAD. MVTOARIAD - Input from SRB, indicates a move to IAD special register array, Icache needs to check its pointer against the pointer driven on IAD. MVFRARIAD - Input from SRB, indicates a move from IAD special register array, Icache needs to check its pointer against the pointer driven on IAD. RTQPPTR(2:0) - Input from decode indicates the current top- of-the-stack pointer for the return stack. This information should be kept in the global shift register in case of mis- predicted branch. RETPC(31:0) - Input from decode indicates the PC address from the top of the return stack for fast way prediction. INVBYTE(3:0) - Input from Idecode to ICPRED indicates the starting byte position of the confused instruction for pre- decoding. INVPRED - Input from Idecode to ICPRED indicates pre- decoding for the confused instruction. INVPOLD - Input from Idecode indicates pre-decoding for the previous line of instruction. The ICFPC should start with the previous line. REFRESH2 - Input from Idecode indicates current line of instructions will be refreshed and not accept new instructions from Icache. MROMEN - Input from MROM indicates the micro-instructions is sent to Idecode instead of the Icache. RETPTR(2:0) - Output indicates the old pointer of the return stack from the mis-predicted branch instruction. The return stack shou1d use this pointer to restore the top-of-the- stack pointer. ICPC(31:0) - Output from Idecode indicates the current line PC to pass along with the instruction to the LOROB. ICPOS0(3:0) - ICLK7 Output to decode unit 0 indicates the PC's byte position of the instruction. ICPOS1(3:0) - ICLK7 Output to decode unit 1 indicates the P