Home
Patent Search
IMT Blog
REGISTER
|
SIGN IN
United States Patent
5887146
Baxter , ; et al.
March 23, 1999
Title
Symmetric multiprocessing computer with non-uniform memory access architecture
Abstract
A very fast, memory efficient, highly expandable, highly efficient CCNUMA processing system based on a hardware architecture that minimizes system bus contention, maximizes processing forward progress by maintaining strong ordering and avoiding retries, and implements a full-map directory structure cache coherency protocol. A Cache Coherent Non-Uniform Memory Access (CCNUMA) architecture is implemented in a system comprising a plurality of integrated modules each consisting of a motherboard and two daughterboards. The daughterboards, which plug into the motherboard, each contain two Job Processors (JPs), cache memory, and input/output (I/O) capabilities. Located directly on the motherboard are additional integrated I/O capabilities in the form of two Small Computer System Interfaces (SCSI) and one Local Area Network (LAN) interface. The motherboard includes main memory, a memory controller (MC) and directory DRAMs for cache coherency. The motherboard also includes GTL backpanel interface logic, system clock generation and distribution logic, and local resources including a micro-controller for system initialization. A crossbar switch connects the various logic blocks together. A fully loaded motherboard contains 2 JP daughterboards, two PCI expansion boards, and up to 512 MB of main memory. Each daughterboard contains two 50 MHz Motorola 88110 JP complexes, having an associated 88410 cache controller and 1 MB Level 2 Cache. A single 16 MB third level write-through cache is also provided and is controlled by a third level cache controller.
Inventors:
Baxter; William F.
(Holliston,
MA
)
, Gelinas; Robert G.
(Westboro,
MA
)
, Guyer; James M.
(Northboro,
MA
)
, Huck; Dan R.
(Shrewsbury,
MA
)
, Hunt; Michael F.
(Ashland,
MA
)
, Keating; David L.
(Holliston,
MA
)
, Kimmell; Jeff S.
(Chapel Hill,
NC
)
, Roux; Phil J.
(Holliston,
MA
)
, Truebenbach; Liz M.
(Sudbury,
MA
)
, Valentine; Rob P.
(Auburn,
MA
)
, Weiler; Pat J.
(Northboro,
MA
)
, Cox; Joseph
(Middleboro,
MA
)
, Gillott; Barry E.
(Fairport,
NY
)
, Heyda; Andrea
(Acton,
MA
)
, Pike; Rob J.
(Northboro,
MA
)
, Radogna; Tom V.
(Westboro,
MA
)
, Sherman; Art A.
(Maynard,
MA
)
, Sporer; Michael
(Wellesley,
MA
)
, Tucker; Doug J.
(Northboro,
MA
)
, Yeung; Simon N.
(Waltham,
MA
)
Assignee:
Data General Corporation
(Westboro,
MA
)
Appl. No.:
695556
Filed:
August 12, 1996
Current U.S. Class:
710/104
710/317
711/3
713/375
Field of Search:
395/284,306,312,403,445,473,553
U.S. Patent Documents
5269013
December 1993
Arbramson et al.
5434993
July 1995
Liencres et al.
5577204
November 1996
Brewer et al.
5603005
February 1997
Bauman et al.
5613153
March 1997
Arimilli et al.
5644753
July 1997
Ebrahim et al.
Other References
Oswell, John, Computing Canada, Looking ahead to ccNUMA, May 9, 1996, vol. 22, No. 10, pp. 42 (1). .
Lenoski, D. et al., The Directory-Based Cache Coherence Protocol for the Dash Multiprocessor, Chap. 2887, pp. 148-159, Aug. 1990. .
Kontothanassis, L., et al., University of Rochester, Software Cache Coherence for Large Scale Multiprocessors, Mar. 1994. .
Stenstrom, P., et al., Computer Systems Laboratory, Comparative Performance Evaluation of Cache Numa and Coma Architectures, vol. 20, No. 2, May 1992. .
Singh, J., et al., Computer Systems Laboratory, Stanford University, An Empirical Comparison of the Kendall Square Research KSR-1 and Stanford Dash Multiprocessors, AMC, pp. 214-225, 1993. .
Chapin, J., et al., Computer Systems Laboratory, Memory System Performance of UNIX on CC-NUMA Multipurposes, vol. 23, No. 1, May 1995. .
Bolosky, W., et al., Numa Policies and Their Relation to Memory Architecture, ACM, pp. 212-221, Sep. 1991. .
Lovett, T., et al., Sequent Computer Systems, Inc., Sting. A CC-NUMA Computer System for the Commercial Marketplace, ISCA, pp. 308-317, Mar. 1996. .
Lenoski, D., et al., Computer Systems Laboratory, The Stanford Dash Multiprocessor, pp. 63-79, Mar. 1992. .
Lenoski, D., et al., IEEE Transactions on Parallel and Distributed Systems, The Dash Prototype: Logic Overhead and Performance, vol. 4, No. 1, Jan. 1993. .
Lenoski, D., et al., Computer Systems Laboratory, The Directory-Based Cache Coherence Protocol for the Dash Multiprocessor, Chap. 2887, pp. 148-159, Aug. 1990. .
Senthil, K., Journal of Parallel and Distributed Computing, A Scalable Distributed Shared Memory Architecture, vol. 23, pp. 547-554, 1994. .
Kontothanassis, L., Journal of Parallel and Distributed Computing, High Performance Software Coherence for Current and Future Architectures, vol. 29, pp. 179-195, 1995. .
Hitoshi, O., Transactions of Information Processing Society of Japan, Performance Analysis of a Data Diffusion Machine with High Fanout and Split Directories, vol. 36, No. 7, pp. 1662-1668, Jul. 1995. .
Nowatzk, A., et al., Parallel Computing: Trends and Applications, Exploiting Parallelism in Cache Coherency Protocol Engines, Grenoble France, pp. 269-286, Sep. 1993. .
Haridi, S., et al., Euro-Par '95 Parallel Processing, Experimental Performance Evaluation on Network-based Shared-memory Architectures, pp. 461-468, 1994. .
Sevcik, et al., Computer Systems Research Institute, Performance benefits and limitations of large Numa multiprocessors, pp. 185-205, 1994. .
Dewan, et al., Southern Methodist University, A Case for Uniform Memory Access Multiprocessors, pp. 20-26. .
Li, et al., Cornell University, Access Normalization: Loop Restructuring for Numa Computers, vol. 11, No. 4, pp. 353-375, Nov. 1993. .
Agarwal, et al., Massachusetts Institute of Technology, The MIT Alewife Machine: Architecture and Performance, pp. 2-13, 1995. .
Chan, Tony, Ninth Annual International Conference, Application of the Scalable Coherent Interface in Multistage Networks, pp. 370-377, 1994. .
Cukic, et al., Uiversity of Houston, The Performance Impact of False Subpage Sharing in KSR1, pp. 64-71, 1995. .
Al-Mouhamed, Transaction of Parallel and Dsitributed Systems, Analysis of Macro-Dataflow Dynamic Scheduling on Nonuniform Memory Access Architectures, vol. 4, No. 8, pp. 875-888, Aug. 1993. .
Wolski, et al., Journal of Parallel and Distributed Computing, Program Partition for Numa Multiprocessor Computer Systems, vol. 19, pp. 203-218, 1993. .
Choe, et al., Seoul National University, Delayed Consistency and Its Effects on the Interconnection Network of Shared Memory Multiprocessors, pp. 436-439. .
Sivasubramaniam, et al., Abstracting Network Characteristics and Locality Properties of Parallel Systems, pp. 54-63, 1995. .
Abdelrahman, et al., University of Toronto, Distributed Array Data Management on Numa Multiprocessors, pp. 551-559, 1994. .
LaRowe, et al., Transactions on Parallel and Distributed Systems, Evaluatin of Numa Memory Management Through Modeling and Measurements, vol. 3 No. 6, Nov. 1992. .
LaRowe, et al., ACM, The Robustness of Numa Memory Management, pp. 137-151, 1991. .
Wilson, A., Jr., ACM, Encore Computer Corporation, Hierarchical Cache/Bus Architecture for Shared Memory Multiprocessors, pp. 244-252, 1987. .
Kuskin, et al., Computer Systems Laboratory, The Stanford Flash Multiprocessor, pp. 302-313, 1994. .
Chandra, R., et al., Computer Systems Laboratory, Scheduling and Page Migration for Multiprocessor Compute Servers, pp. 12-24, 1994. .
Chaiken, D., et al., Massachusetts Institute of Technology., Limitless Directories: A Scalable Cache Coherence Scheme, pp. 224-234, 1991. .
Brown, D., Convex Delivers Beta Appetizers, pp. 1-15, 1994. .
Shreekant, et al., New Directions, Scalable Shared-Memeory Multiprocessor Arachitectures, pp. 71-74, Jun. 1990. .
Singh, et al., Computer, Scaling Parallel Programs for Multiprocessors: Methodology and Examples, pp. 42-50, 1993. .
Singh, et al., Computer Systems Laboratory Stanford University, Load Balancing and Data Locality in Hierarchial N-body Methods, pp. 1-21. .
Brown, D.H., KSR: Addressing The MPP Software Hurdle, pp. 1-18, Dec. 1993..~
Primary Examiner:
An; Meng-Ai T.
Attorney, Agent or Firm:
Michaelis; Brian L. Lowry; David D. Bronstein; Sewell P.
Claims
What is claimed is:
1. A scalable multiprocessor computer system, comprising:
a backplane, including at least one backplane communication bus;
a plurality of motherboards, detachably connected to said backplane; each motherboard interfacing to said at least one backplane communication bus, each of said plurality of motherboards including:
at least one backplane communication bus interface mechanism interfacing at least one of said plurality of motherboards to said at least one backplane communication bus;
a motherboard communication bus comprising a first segment that is selectably interfaceable to said at least one backplane communication bus and at least one second segment, said motherboard communication bus including a crossbar register switch selectably interconnecting said at least one second segment of said motherboard communication bus to said first segment;
a motherboard communication bus request arbitration mechanism arbitrating requests from said plurality of motherboards for access to said first segment and said at least one second segment of said motherboard communication bus by selected ones of said plurality of motherboards;
a memory system including main memory distributed among said plurality of motherboards, directory memory for maintaining main memory coherency with caches on other motherboards, and a memory controller module for accessing said main memory and directory memory and interfacing to said motherboard communication bus; and
at least one daughterboard, detachably connected to said motherboard and interfacing to said motherboard communication bus, said at least one daughterboard further including:
a motherboard communication bus interface module, for interfacing said at least one daughterboard to said motherboard communication bus and a local bus on said daughterboard; and
at least one cache memory system including cache memory and a cache controller module maintaining said cache memory for a processor of said scalable multiprocessor computer system.
2. The scalable multiprocessor computer system of claim 1 wherein said main memory is contiguously addressable across said plurality of motherboards.
3. The scalable multiprocessor computer system of claim 1 wherein said backplane includes four backplane communication busses, and each of said plurality of motherboards includes four backplane communication bus interface modules.
4. The scalable multiprocessor computer system of claim 1 wherein one of said plurality of motherboards is selected to provide a system clock signal for all motherboards.
5. The scalable multiprocessor computer system of claim 4 wherein a second one of said plurality of motherboards is selected to provide a backup system clock signal for all motherboards.
6. The scalable multiprocessor computer system of claim 1 wherein each of said plurality of motherboards includes a peripheral interface module, said peripheral interface module interfacing to said motherboard communication bus, and to at least one peripheral device.
7. The scalable multiprocessor computer system of claim 6 wherein said at least one peripheral device is one of a local area network (LAN) device, a small computer system interface (SCSI) device or an expansion card device.
8. The scalable multiprocessor computer system of claim 1 wherein said at least one daughterboard includes two processor modules.
9. The scalable multiprocessor computer system of claim 8 wherein said at least one cache memory system on said on said at least one daughterboard provides a cache for said two processor modules.
10. A scalable distributed memory multiprocessor computer system including a backplane comprising a plurality of identical independent backplane buses, said backplane providing communication paths for a plurality of motherboards, each of said motherboards including at least one processor with a local cache memory, a motherboard communications bus, a motherboard communications bus to backplane interface module, and a memory system, wherein:
said motherboard communications bus to backplane interface module includes three input queues of high, medium, and low priority, and packets sent on said backplane to a motherboard are placed in one of said input queues depending on the priority of each packet;
all packets to the same cache line use the same one of said plurality of identical independent backplane buses;
high priority packets are always accepted into said memory system on a motherboard without needing to retry;
medium priority packets received from said plurality of identical independent backplane buses are granted onto a motherboard communication bus in the order in which said medium priority packages are received from said plurality of identical independent backplane buses;
packets for cache-inhibited reads, cache-inhibited writes, and cache-inhibited write unlocks are ordered with previous invalidate command and read invalidate reply packets previously inserted in said medium priority input queue;
packets to be received by more than one motherboard will arrive to each motherboard simultaneously;
all copyback invalidate commands and copyback commands are sent out on said backplane to a receiving motherboard;
all invalidate copybacks are sent out on said backplane to a receiving motherboard; and
a motherboard will retry any local resource requests while said medium input queue of said motherboard communications bus to backplane interface module contains any read invalidate reply, invalidate command, or copyback invalidate command packets.
11. The scalable distributed memory multiprocessor computer system of claim 10, wherein:
packets of low priority include cache-inhibited reads, cache inhibited writes, cache-inhibited write unlock and write-through;
packets of medium priority include invalidate commands, read invalidate replies, and copyback invalidate commands; and
packets of high priority include copyback replies, copyback invalidate replies, and writebacks.
12. A scalable multiprocessor computer system, comprising:
a backplane, including at least one backplane communication bus;
a plurality of motherboards, detachably connected to said backplane; each motherboard interfacing to said at least one backplane communication bus, each of said plurality of motherboards including:
at least one backplane communication bus interface mechanism interfacing at least one of said plurality of motherboards to said at least one backplane communication bus;
a motherboard communication bus comprising a first segment that is selectably interfaceable to said at least one backplane communication bus and a plurality of second segments, said motherboard communication bus including means for selectably interconnecting one of said plurality of second segments of said motherboard communication bus to said first segment;
an arbitration means for arbitrating requests from said plurality of motherboards for access to said first segment and said one of said plurality of second segments of said motherboard communication bus by selected ones of said plurality of motherboards;
a memory system including main memory distributed among said plurality of motherboards, directory means for maintaining main memory coherency with caches on other motherboards, and a memory controller means for accessing said main memory and directory memory and interfacing to said motherboard communication bus; and
at least one daughterboard, detachably connected to said motherboard and interfacing to said motherboard communication bus, said at least one daughterboard further including:
a motherboard communication bus interface means for interfacing said at least one daughterboard to said motherboard communication bus and a local bus on said daughterboard; and
at least one cache memory system including cache memory and a cache controller means for maintaining said cache memory for a processor of said scalable multiprocessor computer system.
13. A crossbar register switch for use in a scaleable multiprocessor computer system including at least one backplane communication bus, and a plurality of motherboards interfaced to said at least one backplane communication bus, each of said plurality of motherboards including a motherboard communication bus comprising a first segment that is selectably interfaceable to said at least one backplane communication bus and a plurality of second segments; said crossbar register switch comprising:
an interface to said first segment of said motherboard communication bus;
a plurality of bidirectional ports, said bidirectional ports interfacing to said plurality of second segments of said motherboard communication bus;
a motherboard communication bus request arbitration mechanism arbitrating requests from said plurality of motherboards for access to said first segment and to one of said plurality of second segments of said motherboard communication bus by selected ones of said plurality of motherboards;
wherein said crossbar register switch selectably interconnects said first segment of said motherboard communication bus to one of said second segments of said motherboard communication bus through a bidirectional port based arbitration by said motherboard communication bus request arbitration mechanism.
14. The crossbar register switch of claim 13 wherein said scaleable multiprocessor computer system includes a memory system including main memory distributed among said plurality of motherboards, directory memory for maintaining main memory coherency with caches on other motherboards, and a memory controller module for accessing said main memory and directory memory and interfacing to said motherboard communication bus.
15. The crossbar register switch of claim 13 wherein said crossbar register switch is implemented in a plurality of identical integrated circuit modules.
16. In a scalable distributed memory multiprocessor computer system including a backplane comprising a plurality of identical independent backplane buses, said backplane providing communication paths for a plurality of motherboards, each of said motherboards including at least one processor with a local cache memory, a motherboard communications bus, a motherboard communications bus to backplane interface module, and a memory system; a method for communicating between said motherboard communication bus and said backplane busses comprising:
providing three input queues of high, medium, and low priority;
designating which input queue to place a packet sent on said backplane to a motherboard;
guaranteeing that high priority packets will be received at the destination;
allowing packets to be received by more than one motherboard to arrive to each motherboard simultaneously;
ordering packets in said medium priority queue so that any medium priority packets which involve snoops or invalidations are ordered with previous invalidate command and read invalidate reply packets previously inserted in said medium priority queue across all backplane busses; and
sending any local memory requests out onto one of said backplane busses and into said while said medium priority queue of said motherboard communications bus to backplane interface module contains any read invalidate reply, invalidate command, or copyback invalidate command packets.
17. The method of claim 16, further including the steps of:
designating packets of low priority to include requests;
designating packets of medium priority to include replies and snoops; and
designating packets of high priority to include copyback replies and writebacks.
18. The method of claim 16, further including the steps of:
designating packets of low priority to include cache-inhibited reads, cache inhibited writes, cache-inhibited write unlock and write-through;
designating packets of medium priority to include invalidate commands, read invalidate replies, and copyback invalidate commands; and
designating packets of high priority to include copyback replies, copyback invalidate replies, and writebacks.
Description
RELATED APPLICATION
The present application claims the benefit of U.S. Provisional Application No. 60/002,320, filed Aug. 14, 1995, which is hereby incorporated herein by reference.
FIELD OF THE INVENTION
The present invention relates to multiprocessing computer systems, and more particularly to a flexible, highly scalable multiprocessing computer system incorporating a non-uniform memory access architecture.
BACKGROUND OF THE INVENTION
Symmetric multiprocessing (SMP) computer architectures are known in the art as overcoming the limitations of single or uniprocessors in terms of processing speed and transaction throughput, among other things. Typical, commercially available SMP systems are generally "shared memory" systems, characterized in that multiple processors on a bus, or a plurality of busses, share a single global memory. In shared memory multiprocessors, all memory is uniformly accessible to each processor, which simplifies the task of dynamic load distribution. Processing of complex tasks can be distributed among various processors in the multiprocessor system while data used in the processing is substantially equally available to each of the processors undertaking any portion of the complex task. Similarly, programmers writing code for typical shared memory SMP systems do not need to be concerned with issues of data partitioning, as each of the processors has access to and shares the same, consistent global memory.
However, SMP systems suffer disadvantages in that system bandwidth and scalability are limited. Although multiprocessor systems may be capable of executing many millions of instructions per second, the shared memory resources and the system bus connecting the multiprocessors to the memory presents a bottleneck as complex processing loads are spread among more processors, each needing access to the global memory. As the complexity of software running on SMP's increases, resulting in a need for more processors in a system to perform complex tasks or portions thereof, the demand for memory access increases accordingly. Thus more processors does not necessarily translate into faster processing, i.e. typical SMP systems are not scalable. That is, processing performance actually decreases at some point as more processors are added to the system to process more complex tasks. The decrease in performance is due to the bottleneck created by the increased number of processors needing access to the memory and the transport mechanism, e.g. bus, to and from memory.
Alternative architectures are known which seek to relieve the bandwidth bottleneck. Computer architectures based on Cache Coherent Non-Uniform Memory Access (CCNUMA) are known in the art as an extension of SMP that supplants SMP's "shared memory architecture." CCNUMA architectures are typically characterized as having distributed global memory. Generally, CCNUMA machines consist of a number of processing nodes connected through a high bandwidth, low latency interconnection network. The processing nodes are each comprised of one or more high-performance processors, associated cache, and a portion of a global shared memory. Each node or group of processors has near and far memory, near memory being resident on the same physical circuit board, directly accessible to the node's processors through a local bus, and far memory being resident on other nodes and being accessible over a main system interconnect or backbone. Cache coherence, i.e. the consistency and integrity of shared data stored in multiple caches, is typically maintained by a directory-based, write-invalidate cache coherency protocol, as known in the art. To determine the status of caches, each processing node typically has a directory memory corresponding to its respective portion of the shared physical memory. For each line or discrete addressable block of memory, the directory memory stores an indication of remote nodes that are caching that same line.
One known implementation of the CCNUMA architecture is in a scalable, shared memory multiprocessor system known as "DASH" (Directory Architecture for SHared memory), developed at the Computer Systems Laboratory at Stanford University. The DASH architecture, described in The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor, Lenoski et al., Proceedings of the 14th Int'l Symp. Computer Architecture, IEEE CS Press, 1990, pp 148-159, which is incorporated herein by reference, consists of a number of processing nodes connected through a high-bandwidth, low-latency interconnection network. As is typical in CCNUMA machines, the physical memory is distributed among the nodes of the multiprocessor, with all memory accessible to each node. Each processing node consists of: a small number of high-performance processors; their respective individual caches; a portion of the shared-memory; a common cache for pending remote accesses; and a directory controller interfacing the node to the network.
A weakly ordered memory consistency model is implemented in DASH, which puts a significant burden relating to memory consistency on software developed for the DASH system. In effecting memory consistency in the DASH implementation of CCNUMA architecture, a "release consistency" model is implemented, which is characterized in that memory operations issued by a given processor are allowed to be observed and completed out of order with respect to other processors. ordering of memory operations is only effected under limited circumstances. Protection of variables in memory is left to the programmer developing software for the DASH multiprocessor, as under the DASH release consistency model the hardware only ensures that memory operations are completed prior to releasing a lock on the pertinent memory. Accordingly, the release consistency model for memory consistency in DASH is a weakly ordered model. It is generally accepted that the DASH model for implementing memory correctness significantly complicates programming and cache coherency.
A bus-based snoopy scheme, as known in the art, is used to keep caches coherent within a node on the DASH system, while inter-node cache consistency is maintained using directory memories to effect a distributed directory-based coherence protocol. In DASH, each processing node has a directory memory corresponding to its portion of the shared physical memory. For each memory block, the directory memory stores the identities of all remote nodes caching that block. Using the directory memory, a node writing a location can send point-to-point invalidation or update messages to those processors that are actually caching that block. This is in contrast to the invalidating broadcast required by the snoopy protocol. The scalability of DASH depends on this ability to avoid broadcasts on an inter-node basis.
The DASH architecture relies on the point-to-point invalidation or update mechanism to send messages to processors that are caching data that needs to be updated. All coherence operations, e.g. invalidates and updates, are issued point-to-point, sequentially, and must be positively acknowledged in a sequential manner by each of the remote processors before the issuing processor can proceed with an operation. This DASH implementation significantly negatively affects performance and commercial applicability. As acknowledged in the above-referenced publication describing DASH, serialization in the invalidate mechanism negatively affects performance by increasing queuing delays and thus the latency of memory requests.
DASH provides "fences" which can be placed by software to stall processors until pending memory operations have been completed, or which can be implemented to delay write operations until the completion of a pending write. The DASH CCNUMA architecture generally presents an environment wherein a significant burden is placed on software developers to ensure the protection and consistency of data available to the multiple processors in the system.
The DASH architecture, and more specifically the memory consistency and cache coherency mechanisms also disadvantageously introduce opportunities for livelock and deadlock situations which may, respectively, significantly delay or terminally lock processor computational progress. The multiple processors in DASH are interconnected at the hardware level by two mesh networks, one to handle incoming messages, and the other to handle outgoing communications. However, the consumption of an incoming message may require the generation of an outgoing message, which can result in circular dependencies between limited buffers in two or more nodes, which can cause deadlock.
DASH further dedicates the meshes for particular service: the first mesh to handle communications classified as request messages, e.g. read and read-exclusive requests and invalidation requests, and the second mesh to handle reply messages, e.g. read and read-exclusive replies and invalidation acknowledges, in an effort to eliminate request-reply circular dependencies. However, request-request circular dependencies still present a potential problem, which is provided for in the DASH implementation by increasing the size of input and output FIFOs, which does not necessarily solve the problem but may make it occur less frequently. The DASH architecture also includes a time-out mechanism that does not work to avoid deadlocks, but merely accommodates deadlocks by breaking them after a selected time period. Although the DASH implementation includes some hardware and protocol features aimed at eliminating processor deadlocks, heavy reliance on software for memory consistency, and hardware implementations that require express acknowledgements and incorporate various retry mechanisms, presents an environment wherein circular dependencies can easily develop. Accordingly, forward progress is not optimized for in the DASH CCNUMA architecture.
The CCNUMA architecture is implemented in a commercial multiprocessor in a Sequent Computer Systems, Inc. machine referred to as "Sting" which is described in STING: A CCNUMA Computer System for the Commercial Marketplace, L. Lovett and R. Clapp, ISCA '96, May 1996 incorporated herein by reference. The Sting architecture is based on a collection of nodes consisting of complete Standardized High Volume (SHV), four processor SMP machines, each containing processors, caches, memories and I/O busses. Intra-processor cache coherency is maintained by a standard snoopy cache protocol, as known in the art. The SHVs are configured with a "bridge board" that interconnects the local busses of plural nodes and provides a remote cache which maintains copies of blocks fetched from remote memories. The bridge board interfaces the caches and memories on the local node with caches and memories on remote nodes. Inter-node cache coherency is managed via a directory based cache protocol, based on the Scalable Coherent Interface (SCI) specification, IEEE 1396. The SCI protocol, as known in the art, is implemented via a commercially available device that provides a linked list and packet level protocol for an SCI network. The chip includes FIFO buffers and Send and Receive queues. Incoming packets are routed onto appropriate Receive queues, while the Send queues hold request and response packets waiting to be inserted on an output link. Packets remain on the Send queues awaiting a positive acknowledgement or "positive" echo from the destination as an indication that the destination has room to accept the packet. If the destination does not have queue space to accept a packet, a negative echo is returned and subsequent attempts are made to send the packet using an SCI retry protocol.
The linked list implementation of the SCI based coherency mechanism presents a disadvantage in that the links must be traversed in a sequential or serial manner, which negatively impacts the speed at which packets are sent and received. The retry mechanism has the potential to create circular redundancies that can result in livelock or deadlock situations. The linked list implementation also disadvantageously requires significant amounts of memory, in this remote cache memory, to store forward and backpointers necessary to effect the list.
Machines based on CCNUMA architecture presently known in the art do not take into consideration to any great extent respective workloads of each of the multiple processors as the machines are scaled up, i.e. as more processors or nodes are added. Disadvantageously, as more processors are added in known CCNUMA multiprocessors, limited, if any, efforts are made to ensure that processing is balanced among the job processors sharing processing tasks. Moreover, in such systems, when related tasks are distributed across multiple nodes for processing, related data needed for processing tends to be spread across the system as well, resulting in an undesirably high level of data swapping in and out of system caches.
Methods and operating systems are known for improving efficiency of operation in multiprocessor systems by improving affinity of related tasks and data with a group of processors for processing with reduced overhead, such as described in commonly assigned U.S. patent application Ser. No. 08/187,665, filed Jan. 26, 1994, which is hereby incorporated herein by reference. Further, as described in commonly assigned U.S. patent application Ser. No. 08/494,357, filed Jun. 23, 1995, which is incorporated herein by reference, mechanisms are known for supporting memory migration and seamless integration of various memory resources of a NUMA multiprocessing system. However, known CCNUMA machines generally do not incorporate mechanisms in their architectures for such improvements in load balancing and scheduling.
SUMMARY OF THE INVENTION
The present invention provides a highly expandable, highly efficient CCNUMA processing system based on a hardware architecture that minimizes system bus contention, maximizes processing forward progress by maintaining strong ordering and avoiding retries, and implements a full-map directory structure cache coherency protocol.
According to the invention, a Cache Coherent Non-Uniform Memory Access (CCNUMA) architecture is implemented in a system comprising a plurality of integrated modules each consisting of a motherboard and two daughterboards. The daughterboards, which plug into the motherboard, each contain two Job Processors (JPs), cache memory, and input/output (I/O) capabilities. Located directly on the motherboard are additional integrated I/O capabilities in the form of two Small Computer System Interfaces (SCSI) and one Local Area Network (LAN) interface. The motherboard (sometimes referred to as the "Madre" or "Sierra Madre") includes thereon main memory, a memory controller (MC) and directory Dynamic Random Access Memories (DRAMs) for cache coherency. The motherboard also includes GTL backpanel interface logic, system clock generation and distribution logic, and local resources including a micro-controller for system initialization. A crossbar switch (BAXBAR) is implemented on the motherboard to connect the various logic blocks together. A fully loaded motherboard contains 2 JP daughterboards, two Peripheral Component Interface (PCI) expansion boards, and eight 64 MB SIMMs, for a total of 512 MB of main memory.
Each daughterboard contains two 50 MHz Motorola 88110 JP complexes. Each 88110 complex includes an associated 88410 cache controller and 1 MB Level 2 Cache. A single 16 MB third level write-through cache is also provided and is controlled by a third level cache controller (TLCC) in the form of a TLCC application specific integrated circuit (ASIC). The third level cache is shared by both JPs, and is built using DRAMs. The DRAMs are protected by error correction code (ECC) which is generated and checked by two error detection "EDiiAC" ASICs under the control of the TLCC. Static Random Access Memories (SRAMs) are used to store cache tags for the third level cache. A Cache Interface (CI) ASIC is used as an interface to translate between a packet-switched local (PIX) bus protocol on the motherboard and the 88410 cache controller bus protocol on the JP Daughter Board.
The architecture according to the invention minimizes system bus contention by implementing four backplane or system busses referred to as "PIBus". Each of the four PIBus interconnects is a 64 bit wide, multiplexed control/address/data wire. Multiple system busses may be implemented to provide one, two or four backplane or system busses, depending upon the particular implementation and the related coherency protocol(s). The PIBus, in an illustrative embodiment described hereinafter is used in implementing a directed-broadcast system bus transfer protocol that limits system wide resource overhead to modules or nodes targeted to service a request.
Throughput on the PIBus is maximized, and transfer latencies minimized, by a memory based, full-map directory structure cache coherency protocol, that minimizes snooping. The full-map directory structure is maintained in the memory modules that are accessible over the PIBus. Each directory contains one entry per cache line in the corresponding memory. The directory entries contain coherency information for their respective cache lines. The directory entry fields include: valid bits; modified bit; lock bit; unordered bit and an ordered bit. All memory addresses on the PIBus are routed to the appropriate memory module. Each address is put in a queue for service by the memory. Each address is looked up in the directory and the memory will generate a response based on the directory contents and the type of access requested. The memory will send a response which will be picked up only by those nodes that have a valid copy of the accessed cache line, i.e. a directed broadcast. The responses from memory issued in the directed broadcast transfer protocol include invalidates, copyback and read data. The directed broadcast transfer protocol implementation according to the invention avoids unnecessary processor stalls in processors whose caches do not have a copy of the line being addressed, by forwarding "snoop" traffic in a manner that it will only affect those nodes that have a valid copy of the line being addressed. The memory uses the valid bit field in the directory as an indicator as to which nodes have a copy of an accessed cache line.
Ordering of events occurring with respect to the backbone or backplane PIBus is effected so as to maximize processing forward progress by maintaining strong ordering and avoiding retries. All of the operations initiated by one requester must appear to complete in the same order to all other requesters, i.e. cache, processor(s), I/O, in the system. Events are ordered by adhering to a three level priority scheme wherein events are ordered low, medium or high. Strict rules are implemented to ensure event ordering and to effect coherent ordering on the PIBus between packets of different priorities.
The three level priority scheme according to the invention, works in conjunction with arbitration services, provided by an "ORB" ASIC, to effectively guarantee forward progress and substantially avoid livelock/deadlock scenarios. The arbitration mechanism is a function of the type of bus involved, and accordingly there is arbitration associated with the local PIX bus, i.e. local to the motherboard, and arbitration associated with access to the system wide or PIBus.
The motherboard level PIX busses each use a centralized arbitration scheme wherein each bus requester sends the ORB ASIC information about the requested packet type and about the state of its input queues. The ORB ASIC implements a fairness algorithm and grants bus requests based on such information received from requesters, and based on other information sampled from requesters. The ORB samples a mix of windowed and unwindowed requesters every bus clock cycle. Windowed requests have associated therewith particular time periods during which the request signal must be sampled and a grant issued and prioritized in accordance with predetermined parameters. At the same time that PIX bus requesters are being sampled, the ORB samples the busy signals of the potential bus targets. During the cycle after sampling, the ORB chooses one low priority requester, one medium priority requester and one high priority requester as potential bus grant candidates, based on: ordering information from a low and a medium request tracking FIFO; the state of the Busy signals sampled; and a "shuffle code" which ensures fairness of bus grants. Further selection for a single candidate for the PIXbus grant involves a prioritization algorithm in which high priority requests have priority over medium requests which have priority over low, and in which medium level requests are subjected to a "deli-counter-ticket" style prioritization scheme that maintains time ordering of transactions. High and low priority requests are not strictly granted based on time ordering.
The system wide backpanel, or PIBus arbitration mechanism is handled separately for each of the four PIBusses. The arbitration/grant logic is distributed across respective "PI" ASICs, which facilitates traffic between the PIX bus and the PIBus in both directions. PIBus arbitration is based on a "windowed-priority" distributed arbitration with fairness, in which there are specific times, i.e. windows, during which request signals are sampled and then grants associated with each request are prioritized. The requests are prioritized based on a shuffle code that ensures fairness. Since the arbitration logic is distributed each PIBus requester knows the request status of all the other requesters on the bus, and all the local requester only needs to know if a particular grant is for itself or another requester.
The "BAXBAR" crossbar switch is implemented on the motherboard to connect the various logic blocks of the CCNUMA architecture according to the invention together, and to propagate transfers between the busses on the motherboard and the daughterboard. The crossbar switch supports six 19 bit bidirectional ports and two 18 bit bidirectional ports, and is controlled by a three bit port select and an eight bit enable control. The port select bits control selection of eight potential sources for outputs, and also enable selected output ports.
Features of the invention include a highly efficient, high performance multiprocessor distributed memory system implemented with a high speed, high bandwidth, extensible system interconnect that has up to four busses available for multiprocessor communication. The architecture provides a highly scalable open-ended architecture. In contrast to the typical bus-snooping protocols known in the art, in which each cache must look up all addresses on the bus, the directed broadcast protocol according to the invention increases system performance by not interfering with nodes that do not have a copy of an accessed cache line. Accordingly, unnecessary processor stalls are avoided. The CCNUMA system implementation according to the invention maximizes forward progress by avoiding retries and maintaining strong ordering and coherency, avoiding deadly embraces. Strong ordering, i.e. completion of any two consecutive operations initiated by a single requester being observable by any other entity, i.e. cache, processor, I/O, only in their original order, takes much of the burden and complexity relating to memory consistency out of the hands of software implementations and rest it with hardware in a manner that makes for greater consistency and predictability. The system wide or backplane bus distributed arbitration mechanism ensures fairness in bus accesses while maintaining ordering to a high degree. Node-local centralized local bus arbitration effects highly efficient and fair access to local resources.
BRIEF DESCRIPTION OF THE DRAWING
These and other features and advantages of the present invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawing in which:
FIG. 1 is a high level block diagram of a multiprocessor system implementing a CCNUMA architecture according to the invention;
FIG. 2 is a block diagram of a motherboard of the multiprocessor system of FIG. 1;
FIG. 3 is a block diagram of one daughter board for connection to the motherboard of FIG. 2;
FIG. 4 is a memory map of distributed system memory distributed among the motherboards of the multiprocessor system of FIG. 1;
FIG. 5 is a block diagrammatic overview of a PI asic controlling access to and from a system backplane or PIBus;
FIG. 6 is a Table representing PIXBUS Operation Decode and Queue Priority Assignment;
FIG. 7 is a block diagram of a PI Header Buffer and Data Queue;
FIG. 8 is a block diagram of PI arbitration;
FIG. 9 is a high level block diagram of the memory complex of the multiprocessor system of FIG. 1;
FIG. 10 is a block diagram of a memory controller ASIC;
FIGS. 11-24 are state machine diagrams for state machines implementing functionality in the memory controller of FIG. 10;
FIG. 25 is a block diagram of an Error Detection and Control device ("EDiiAC" or "EDAC") ASIC;
FIGS. 26A and 26B are Tables of Cache Request Transitions;
FIG. 27 is a Table of Cache Inhibited Request Transitions;
FIG. 28 is a block diagram of an ORB ASIC;
FIG. 29 is a state machine diagram for a TR.sub.-- TRACKER state machine implemented in the ORB ASIC of FIG. 28;
FIG. 30 is a block diagram of a BaxBar crossbar switch;
FIGS. 31A, 31B and 31C illustrate crossbar source selection, PORT.sub.-- OE assignments, and port to bus mapping of the BaxBar crossbar switch, respectively;
FIG. 32 is a block diagram of a GG ASIC;
FIG. 33 is a block diagram of an RI ASIC;
FIGS. 34-36 are state machines for resources operation request, resource bus request and resources looping, respectively, implemented in the RI ASIC of FIG. 33;
FIG. 37 is a block diagram of a CI ASIC; and
FIG. 38 is a block diagram of a TLCC ASIC.
DETAILED DESCRIPTION
As illustrated in FIG. 1, a CCNUMA processing system according to the present invention includes a plurality of motherboards (52) interconnected by a backplane (54). The backplane includes 4 PI buses (56), which provide communication between the motherboards (52). The PI busses (56) are all identical, allowing up to four sets of motherboards (52) to transfer data simultaneously. Each motherboard (52) is a standard module, allowing the processing system (50) to contain virtually any number of motherboards required for the processing load. Motherboards (52) are easily added to increase the processing power.
A single motherboard (52), as illustrated in FIG. 2, is an integrated module containing processors, memory, and I/O. The processors, memory, and I/O expansion facilities are all contained on separate daughter boards or SIMMs (Single Inline Memory Modules) which plug into the motherboard. Located directly on the motherboard there are additional integrated I/O facilities, including 2 SCSI (Small Computer System Interface) and 1 LAN (Local Area Network). The motherboard also includes a memory controller and directory DRAMs (for cache coherency), Local Resources including a micro-controller for system initialization, GTL backpanel interface logic, System Clock generation and distribution logic, and a Crossbar switch to connect the various logic blocks together.
A fully loaded motherboard (52) contains 2 processor Daughter Boards (58a), (58b), two PCI expansion boards (60a), (60b), and 512 MB of main memory (62) comprised of eight 64 MB SIMMs. Many of the functional modules are implemented using ASICs (Application Specific Integrated Circuits).
Functional Overview
PIBus
The primary communication between processors across the backpanel is accomplished using the PIBus Interface. A single PIBus (56) consists of a multiplexed 72-bit Address CTRL/Data bus and associated arbitration and control signals. Each motherboard (52) implements 4 identical PIBus Interfaces using respective PI ASICs (64a-c), as will be described hereinafter. System traffic is partitioned across the 4 PI Busses (56) by address, so that each bus (56) is approximately equally utilized. The PIBus (56) is implemented using GTL logic. This is a logic level/switching standard that allows for very high speed communication across a heavily loaded backpanel. The logic signals switch between 0.4 and 1.2 V.
PIXBus
The PIXbus (66) is the name given to the bus protocol that is used to connect the functional elements of the motherboard (52) together. This is a packetized 72 bit wide multiplexed address/data bus using a similar protocol to that which the PIBus (56) uses across the backpanel (54). This bus (66) is actually implemented as a series of busses that connect into/out of a central crossbar switch (68), referred to in some places herein as the "BaxBar". The PIXbus is implemented, using LVTTL technology, via 4 BaxBar ASICs (70). A major portion of the PIX Bus (66) is an interconnection between the BaxBar ASICs (70) and the four PI (PIBus Interface) ASICs (64a-d). This bus (66) uses AC Termination for signal integrity and timing. Arbitration for the PIXBus is provided by an ORB ASIC (98), as described in detail hereinafter. The complete PIXBus is actually comprised of a plurality of individual busses interconnecting the functional components on the motherboard of the system according to the invention, including:
an RI bus (72) portion of the PIXBus which connects the BaxBar ASICs (70) to an RI (Resources Interface) ASIC (74) and to debug buffers and a debug connector;
a GG bus (76) portion of the PIXBus which connects the BaxBar ASICs (70) to two GG (Golden Gate, I/O Interface) ASICs (78a-b). This bus uses series resistors near to the GG for Signal Integrity/timing improvement;
an MC Bus (80) portion of the PIXBus connects the BaxBar ASICs (70) to a MC (Memory Controller) ASIC (82);
a CIO Bus (88a) portion of the PIXBus connects the BaxBar ASICs (70) to a first daughterboard (58a);
a CI1 Bus (88b) portion of the PIXBus connects the BaxBar ASICs (70) to a second daughterboard (58b); and
MUD.sub.-- L (92) and MUD.sub.-- H Bus (94) portions of the PIXBus which are two busses used to connect the BaxBar ASICs (70) to two EDiiAC ASICs (96) facilitating data integrity of data from the memory system which is generally comprised of memory (62) and directory tag memory (86).
Memory Subsystem
The Memory subsystem on the motherboard (52) is capable of providing up to 512 MB of system memory for the processing system (50). Actual DRAM storage is provided by up to eight 16M (36) standard SIMMs (62). One motherboard (52) can be populated with 0, 4 or 8 SIMMs. Data is typically accessed in full 64 Byte Cache blocks, but may also be read and written in double word or 64 bit quantities. The memory data is protected using ECC (Error Correction Code) which is generated for data correction using two of the EDiiAC ASICs (96a-b). Each EDiiAC (96) provides a 64 bit data path and the two are used to interleave within a cache block to maximize performance.
In addition to the main memory data store, the memory subsystem also contains storage for a full map directory (86) which is used to maintain cache coherency, as described in detail hereinafter. The directory (86) is implemented using 4M.times.4
DRAMs attached directly to the motherboard (52). The directory is organized as a 8M.times.17 storage using 11 data bits and 6 ECC bits. The ECC codes for both the directory and the main data store are capable of correcting all single bit errors and detecting all double-bit errors.
I/O Subsystem
The I/O subsystem of the motherboard (52) is comprised of two independent PCI channels (79a-b) operating at 25 MHz. Each PCI channel (79) is interfaced to the PIX bus (66) using a single GG ASIC (78) which also contains an integrated cache for I/O transfers. The GG ASIC (78) contains all necessary logic to provide the interface between the 50 MHz PIX bus (66) and the 25 MHz PCI bus (78), including PCI arbitration. The GG ASIC (78) also serves as a gatherer of interrupts from system wide areas and combines these interrupts and directs them to the appropriate processor.
Each of the two PCI busses (79) is connected to an integrated SCSI interface (98), and to a single expansion slot (60). One of the two PCI busses (79a) also contains an integrated 10 Mb LAN interface (100). The two SCSI interfaces (98a-b) are implemented using the NCR825 Integrated PCI-SCSI controller as a pair of Wide Differential SCSI-2 interfaces. Each controller is connected through a set of differential transceivers to a 68 pin High Density SCSI connector (not shown). The single LAN connection (100) is made using the DECchip 21040 PCI-Ethernet controller. This provides a single chip integrated LAN which is connected to an RJ-45 connector (not shown).
The two expansion PCI slots are provided for by attaching a PCI Daughterpanel to the motherboard. This small board provides a connection between high-density AMP connectors and a standard PCI card connector. The board also allows the two PCI cards to be plugged in parallel to the motherboard. The motherboard design has space to allow two half size PCI cards to be plugged into each motherboard. Further PCI expansion is achieved by using a PCI expansion chassis, and plugging a host-side adapter cable into one of the motherboard expansion slots.
Resources
Each motherboard (52) contains all the local resources that are required of a system (50), with the exception of the System ID PROM (not shown) which is contained on the backpanel (54). The resource logic on the motherboard (52) includes a Microcontroller (102), state-recording EEPROMs (Electrically Erasable Programmable Read Only Memory, not shown), NOVRAM (Non-Volatile RAM), and SCAN interface logic (104) which is described in detail in copending commonly owned PCT Application Ser. No. PCT/US96/13742 (Atty Docket No. 158/46,642), HIGH AVAILABILITY COMPUTER SYSTEM AND METHODS RELATED THERETO, which is incorporated herein by reference. The resource logic is duplicated on each motherboard (52), but a working system (50) only ever uses the resources section of the board in either slotO or slot1 of the backplane system (54) as system wide Global Resources. An RI (Resources Interface) ASIC (74) provides the interface between the PIXbus (72) and the devices within the Resources section on the motherboard (52).
The Microcontroller (102) in the resources section is used to perform low-level early power-up diagnostics of the system (50) prior to de-asserting RESET to the processors. It is also the controller/engine used for all scan operations, as described in the referenced application. Generally, scan is used to configure the ASICs during power up, communicate with the power supplies and blowers, communicate with the various ID PROMs within the system, and to dump failure information after a hardware fatal error. If a processor needs to do a scan operation, it makes a request to the micro-controller (102) which can then perform the required operation.
The Resources sections also provides a DUART (Dual Asynchronous Universal Receiver and Transmitter, not shown) for implementing 3 UART ports for the system (50). A fourth UART port is also used as part of a loopback circuit to allow a processor to monitor what is being driven on the main system console (not shown).
The resources section also provides the logic to do JTAG based scan of all the ASICs in the system (50), power supplies, blowers, SEEPROM and SYSID PROM, in accordance with the IEEE 1149.1 standard. The logic is in place to allow the system to be scanned either during Manufacturing Test using an external tester (e.g. ASSET) or during normal operation/power-up using the microcontroller on any motherboard in the system. This logic allows simple boundary scan testing to be used as part of the power-up system testing to detect and isolate possible faulty components.
Additionally, Macro Array CMOS High Density devices (MACHs) which are high density electrically erasable CMOS programmable logic, on the resource bus can be programmed using JTAG from an external connector. Also, the microcontroller can be used with an external connector to program the EEPROMs on the resource bus. This allows manufacturing to assemble the boards with blank MACHs and EEPROMs and then "burn" them as part of the test procedure, rather than stocking "burned" versions of the parts to be installed during assembly. This "in circuit programmability" feature also makes updates for ECO activity as simple as plugging in the programming connector and re-programming the parts, rather than removing the old part and installing a new part in its place.
Clocks Each motherboard (52) contains the necessary logic to generate and distribute both 50 MHz and 12.5 MHz clocks to the other boards in the system (not shown). It also contains the logic to distribute the received clocks from the backpanel to all appropriate clock loads with a minimum of added skew. The clocks for a system (50) will always be sourced by either the motherboard (52) in slot 0 or the motherboard (52) in slot 1. Each slot receives clocks from both slots and selects clocks from the appropriate slot (slot 0 unless the clocks from slot 0 have failed).
Each motherboard contains two PECL crystals used for generation of all system clocks. These two crystals are a 100 MHz nominal clock crystal and a 105 MHz margin clock crystal. Both of these crystals are passed through a divide by two circuit to produce 50 and 52.5 MHz system clocks with 50% duty cycle. These two clocks are muxed together to produce the system clock for the system (50). The multiplexing is controlled from the resources section and allows either nominal or margin clocks to be used by the system. The chosen clock is buffered and 8 differential copies (one for each slot in a system) are driven out to the backpanel (PECL.sub.-- CLK.sub.-- OUT). A ninth copy of the system clock is further divided to produce a nominally 12.5
MHz signal which is used to generate the 12.5 MHz scan/resources clock on each motherboard. Eight differential copies of this signal are also distributed to the backpanel.
Each motherboard receives two 50 MHz system clocks from the backpanel. All first level differential pairs are routed to the same length, and all second level differential pairs are routed to the same length to reduce clock skew.
50 MHz TTL clocks are produced using a translator/distribution device, such as a Synergy Copyclock as known in the art. This device receives a differential PECL clock and translates it to TTL. An external feedback loop is used with the translator to add phase delay to the output clocks until the input of the feedback clock is in phase with the input clock. This has the net effect of eliminating skew between the differential PECL clock distributed to the ASICs and the TTL clock distributed to the EDiiACs (96) and synchronizing buffers.
The PECL clock lines are thevenin terminated to VDD (3.3 V) using 62 ohm over 620 ohm resistors. The TTL clocks are source series terminated inside the translator chip.
Each motherboard (52) generates a 25 MHz clock that is used for the PCI devices. This clock is derived from the 50 MHz system clock divided by two, and is then PECL to TTL translated by the translator. The length of the feedback loop for the translator was calculated to provide the desired skew correction to make the 25 MHz clock have the minimum skew in relation to the 50 MHz clock.
All the clock lines are thevenin terminated the same way as the 50 MHz clocks with the exception of the expansion clocks which are series terminated using 51 ohm resistors.
Each motherboard (52) contains logic that allows it to detect and signal that there is a problem with the clock distribution logic. In slots 0 and 1 this logic also forms a means to have the clock distribution automatically failover from clocks in slot 0 to clocks in slot 1, as described in the referenced PCT application.
Daughter Boards
The system Daughter Boards (58), as illustrated in FIG. 3., each contain two 50 MHz Motorola 88110 processor complexes. Each 88110 processor (110) has an associated 88410 cache controller (112) and 1 MB Level 2 Cache (114) built using eight MCM67D709 SRAMs. A single 16 MB third level write-through cache (116) is also provided and is controlled by a TLCC (Third Level Cache Controller) ASIC (118). The third level cache (116) is shared by both processors (110), and is built using ten 60 ns
1M.times.16 DRAMs. The DRAMs are protected by ECC (Error Correction Code), which is generated and checked by two EDiiAC ASICs (120) under the control of the TLCC ASIC (118). Tag memory (122) built with three 12 ns 256K.times.4 SRAMs is used to store the cache tags for the Third Level Cache. A CI ASIC (124) is used to translate between the packet-switched PIX bus protocol on the motherboard (52) and the 88410 cache controller data bus (126) protocol on the Daughter Board (58).
System Functional Description
PIX Bus Interface
The system according to the invention uses a packetized split response bus protocol to communicate between the processors and memory or I/O. The system also uses a Directory based cache coherency mechanism to eliminate snoop cycles on the main system busses. The CI ASIC's (124) main function is to serve as a translation/sequencer between the PIX bus protocol that is used on the motherboard (52) and the 88410 bus protocol on the daughterboard (58). All off board communication with the exception of Clocks and Reset are part of the PIX bus and is connected directly to the CI. The PIX bus (88) consists of a 64 bit address/data bus with 8 bits of parity, 2 additional "bussed" control signals that indicate the length of the current packet and an error indication. There are an additional 11 signals that are used to provide arbitration control. The PIX bus categorizes different bus operations into three different priorities, LOW, MED, and HIGH, and each PIX bus entity implements queues as appropriate to allow it to receive multiple packets of each priority, as described hereinafter. The CI ASIC (124) only receives Low or Med packets and generates only Low and High packets.
Cache Bus Interface
The two CPU complexes, CI, and TLC, all on the daughterboard, are connected together by the S.sub.-- D bus (126), consisting of 64 bits of data and 8 parity bits, and the S.sub.-- A bus (128) which consists of 32 bits of address and additional control lines (130). Arbitration for access to the cache bus is performed by the CI ASIC (124). There are three possible bus masters; each of the two processors (110) for read and write operations (data transfers to or from cache) and the CI (124) for snoop operations (no data transfer). The TLC (118) is always a bus slave. Due to pin limitations, the CI ASIC (124) multiplexes the 32 bit S.sub.-- A (128) and 32 bits of the S.sub.-- D bus (126) into a 32 bit S.sub.-- AD bus (134). This multiplexing is done using four LVT162245 devices (134).
When an 88110 processor (110) detects a parity error during a read operation it asserts a P.sub.-- BPE.sub.-- N signal for a single cycle. This signal is monitored by the CI ASIC (124) and will cause a Fatal Error to be asserted when detected.
Because the system coherency is maintained by the MC (82, FIG. 2) and the directory, the CPU complexes must be prevented from modifying a line of data that was previously read in. This is done by causing all read requests to be marked as SHARED in the 88410 (112, FIG. 3), and 88110 (110). In hardware, this is accomplished by pulling down S.sub.-- SHRD.sub.-- N and S.sub.-- TSHRD.sub.-- N pins on the 88410 (112) and the P.sub.-- SHD.sub.-- N signal on the 88110 (110).
Third Level Cache
The Third Level Cache (TLC) on the daughterboard (58) is a 16 MB direct mapped cache implemented using 1M.times.16 DRAMs. The cache is implemented using a write-through policy. This means that the cache never contains the only modified copy of a cache line in the system, and as such only ever sources data to either of the two processors (110) on the daughterboard (58) as the result of a read request.
The data store for the cache is constructed from 10 1M.times.16 60 ns DRAMs (116). These DRAMs are organized as two banks of 5 DRAMs which contain 64 bits of data plus 8 bits of ECC. Each bank of DRAMs is associated with an EDiiAC ASIC (120a-b) which is used to buffer the data and to perform error detection and correction of data read from the cache. The system outputs of the two EDiiACs are multiplexed down to the 64 bit S.sub.-- D bus (126) using six ABT16260 2:1 latching multiplexers (138). The tag store for the cache is implemented using three 256K.times.4 12 ns SRAMs (122). Control for the whole TLC is provided by the TLCC ASIC (118), as described in detail hereinafter. Due to timing constraints on the S.sub.-- D bus (126) the output enable and mux select for the ABT16260 muxes (138) are driven by an FCT374 octal register (not shown). The inputs to the register are driven out one cycle early by the TLCC ASIC (118). The latch enables used to latch data from the S.sub.-- D bus (126) also use external logic. They are derived from the 50 Mhz clock, described in the clock distribution section.
The data bits into the low EDiiAC (120b), accessed when a signal S.sub.-- A[3] is a 0, are logically connected in reverse order, i.e. SD.sub.-- L[0] is connected to pin SD63, SD.sub.-- L[1] to pin SD62, SD.sub.-- L[63] to pin SD0. The parity bits are also reversed to keep the parity bits with their corresponding byte of data. This reversal of bits MUST be taken into account by any software that does diagnostic reads and writes of the EDiiACs (120).
The TLCC (118) is designed to operate correctly with several different types of DRAMs. It is capable of supporting both the 1K and 4K refresh versions of 16 MBit DRAMs. The 4K refresh DRAMs use 12 row address bits and 8 column bits to address the DRAM cell. The 1K refresh parts use 10 row and 10 column bits. To allow the use of either DRAM, row address lines A10 and A11 are driven out on A8 and A9 during the column address phase. These bits are ignored by the 4K refresh components in the column address phase, and the A10 and A11 lines are No Connects on the 1K refresh DRAMS. The TLCC (118) also supports DRAMs that use either 1 or 2 Write Enables (WE). This can be done because the minimum access size for the DRAMs is a 64 bit double word. Therefore, the two WE lines for each DRAM can be tied together. On DRAMs that use a single WE, the extra WE is a No Connect.
CPU Complex
The daughterboard (58) contains two CPU complexes. Each complex consists of an 88110 CPU (110), 88410 Level 2 Cache Controller (112) and 8 67D709 128K.times.9 SRAMs (114). The 88110 and 88410 are implemented using 299 and 279 PGA's (Pin Grid Arrays) respectively. The SRAMs are 32 pin PLCC's and are mounted on both sides (top and bottom) of the daughterboard (58).
The SRAMs (114) are 67D709 SRAMs that have two bidirectional data ports which simplifies the net topology for data flow from the memory system to the processor. One data port is used to transfer data to/from the 88110 on the P.sub.-- D bus (140a-b), the other data port connects the two SRAM complexes together and also connects to the TLC muxes and either the CI or the CI transceivers on the S.sub.-- D bus (126). The board (58) is laid out so that the S.sub.-- D bus (126) is less than 8.5" in length. This length restriction allows the bus (126) to be operated without any termination and still transfer data in a single 20 ns cycle. The P.sub.-- D bus (140) is a point-to-point bus between the SRAMs (114) and a single 88110 (110). This bus is approximately 6" long.
The control signals for the SRAMs (114) are driven by the 88410 (112) for all accesses. To provide the best timing and signal integrity for all of these nets, they are routed using a "tree" topology. This topology places each of the 8 loads at an equal distance from the 88410 (112a-b), which helps to prevent undershoot and edge rate problems. The exception to this topology is R.sub.-- WE.sub.-- N[7:0] lines which are point-to-point from the 88410 (112) to the SRAMs (114). These use 22 ohm Series Resistors to control the edge rate and undershoot (not shown).
To prevent Write-through operations from occurring on the System bus a P.sub.-- WT.sub.-- N pin on the 88110 (110) is left disconnected, and the corresponding pin on the 88410 (112) is pulled up. To help alleviate some hold time issues between the CI ASIC (124) and the Cache RAMs, the Cache RAM clocks are skewed to be nominally 0.2 ns earlier than the other 50 MHz clocks on the board (58).
Clocks
The daughterboard (58) receives two PECL differential pairs from the motherboard (52) as its source clocks (not shown). One of the pairs is the 50 MHz System Clock and the other is the 12.5 MHz, test/scan clock. Each of the two clocks is buffered and distributed as required to the devices on the daughterboard (58). The clock distribution scheme on the daughterboard (58) matches that used on the motherboard (52) to minimize overall skew between motherboard (52) and daughterboard (58) components. Differential PECL is also used to minimize the skew introduced by the distribution nets and logic.
All etch lengths for each stage of clock signal distribution tree are matched to eliminate skew. There are a couple of exceptions to this. The clocks that are driven to the 2nd Level Cache RAMs (114) are purposely skewed to be 500 ps earlier than the other 50 MHz clocks. This is done to alleviate a Hold time problem between the CI ASIC (124) and the SRAMs (114) when the CI ASIC is writing to the SRAMs (line fill).
JTAG
The daughterboard (58) has a single IEEE 1149.1 (JTAG) scan chain that can be used both for Manufacturing and Power-Up testing, and scan initialization of the CI (124) and TLCC (118) ASICs. The EDiiACs (120), 88110's (110) and 88410's (112) all implement the five wire version of the JTAG specification, but will be operated in the 4-wire mode by pulling the TRSTN pin high. The CI (124), TLCC (118), and board level JTAG logic all implement the four wire version. A TCK signal is generated and received by the clock distribution logic. The devices in the chain are connected in the following order: CI (124).fwdarw.Lo EDiiAC (120a).fwdarw.Hi EDiiAC (120b).fwdarw.TLCC (118).fwdarw.TLC Address Latch (142).fwdarw.88110 A (110a).fwdarw.88410 A (112a).fwdarw.88110 B (110b).fwdarw.88410 B (112b).fwdarw.SEEPROM (144).
SEEPROM
A Serial EEPROM (144) is used on the daughterboard (58) to provide a non-volatile place to store important board information, such as Board Number, Serial Number and revision history. The SEEPROM chosen does not have a true JTAG interface, therefore it cannot be connected directly into the scan chain. Instead, a JTAG buffer 74BCT8373 (not shown) is used to provide the interface between the two serial protocols.
System ASICs
Much of the functionality effected in the CCNUMA system according to the invention is implemented in ASICs, as generally described hereinbefore, and more particularly described hereinafter.
PI ASIC
In monitoring PIBUS-to-PIXBUS traffic, the PI ASIC determines when some node starts a tenure on the PIBUS by observing the request lines of all the nodes, and calculating when the bus is available for the next requester. The PI ASIC(s) (of which there are four, 64a-d, and which may be referred to interchangeably as "PI") have responsibility for examining all traffic on the PIBUS (56), and responding to specific operations that it is involved in. The PI determines when a transfer is started by monitoring the PIBUS request information. There are three different ways that an operation can be decoded as targeted to a particular PI's node. These are: Node-field Bit Compare, ID Originator Node Parsing, and Address Decode.
The first beat (i.e. data transfer during one system clock cycle) of a transaction packet (also known as a Header beat) is always either a node type or an address type. If the first beat is a node type then the second beat is always an address type. Information in the operation field determines which combination of decode mechanisms to use.
If the first beat is a node type, then this transfer has come from a memory controller's (82) directory control logic. Transfers require snooping local to all nodes which have their respective bit set in the 16-bit node field. If the bit is set, the PI (64) is responsible for requesting the PIXBUS (56) and forwarding the transfer inward.
If the first beat is address type, then the operation field is parsed to determine whether to look at the requester ID or the address. If the first beat operation field implies the requester ID match the PI's node ID register, then the PI is responsible for requesting the PIXBUS and forwarding the transfer inward.
If the first beat is address type, and the command field does not imply the requester ID compare, then the address is parsed to determine if the PI's node is the target of the transfer. If the physical address range compare results in a match, then the PIXBUS (66) is requested, and the transfer is forwarded inward.
If the address range compare results in a match for the control, internal devices, or I/O channel mappings, the PIXBUS is requested and the transfer is forwarded inward.
Address decode consists of five range compares. These range compares are based on boundaries which are initialized at powerup. The memory map for the illustrative embodiment of the multiprocessor system according to the invention is shown in FIG. 4.
The global resource space (150) resides in the top 4 MB of the 32-bit address range. It is contiguous. Only one node (i.e. motherboard) in the system is allowed to respond to Global Space access. Global Space (150) contains resources such as PROM, DUARTs, boot clock, and a real time clock (RTC, not shown). A Valid bit in an address decoder will be used to determine which node currently owns the Global Space.
Directly below the global resource space is 4 MB of Software Reserved area (154) and 3 MB of unused memory space (156). Below the Software Reserved Space is 1 MB of Local Control Space Alias (158). It is used to access node local control space without having to know specifically which node it is accessing. This function is implemented in the Cl ASIC (124), which converts any address issued by a processor (110) in the local control space alias (158) into an address in that node's control space.
The Per-JP Local Resources (160) follow the Local Control Space Alias segment. Per-JP Local Resources include 88410 (112) flush registers, a WHOAMI register used to identify a respective node, per-JP programmable interval timer (PIT), per-JP Interrupt registers, and cross interrupt send registers.
The next segment is the 16 MB Control space (162). Control Space is evenly partitioned over 16 nodes, so the minimum granularity for decoding of incoming addresses is 1 MB.
The next segment used is the 16 MB of Third Level Cache (TLC) Tag Store (166). The TLC maps addresses into this space to allow simple access for prom initialization and diagnostics. JP generated addresses in this range will not appear beyond the bus which the TLC resides (i.e. CI (124) will not pass these addresses to the CI.sub.-- BUS (130),(126)). Therefore, the PI ASIC (64) will not have to do any special address decode for this address range.
Directly below Control Space (150) is the 64 MB dedicated to the integrated devices (168). The PI ASICs (64) will have a 2 MB granularity while the GG ASICs (78) will have a 1 MB granularity. Integrated Device space must be contiguous on each node. Holes are allowed between node assignments.
I/O channel space (172-174) exists between the highest physical memory address and the lower limit of the integrated devices space in the address range E000.sub.-- 0000 to F7FF.sub.-- FFFF. It must be contiguous on each node. Holes are allowed between node assignments. It has a 32 MB granularity. It is typically used for VME (Versa Module Eurobus) I/O.
Physical memory (176-180) must be contiguous on each node. Holes are allowed between node assignments. However, some software may not allow such holes. Physical memory has a granularity of 128 MB. The architecture of the present system is set up to require that one node in the system contain modulo 128 MB of memory starting at address 0 (bottom of memory).
Incoming PIBus Transfer (PIBus to PIXbus)
The third cycle of a PIBUS transfer is the reply phase. This allows one cycle for decoding of the address/node information presented in the first beat. The interpretation of these pins differs between node and addr type first beats.
If the first beat is a node type, then this operation is snoopable. Under that condition, all PIs (64) whose Node ID match their respective node field bit found in the node beat and is able to accept the transfer (P-TRANS queue not full) must assert PI.sub.-- RCVR.sub.-- ACk.sub.-- N. If a PI's Node ID matches it's respective node field bit and the PI's P-TRANS queue is full, the PI must assert PI.sub.-- RSND.sub.-- N. If no PI.sub.-- RCVR.sub.-- ACK.sub.-- N or PI.sub.-- RSND.sub.-- N is asserted during the reply cycle, this is a system fatal error, and must be reported as such.
To ensure that none of the target PI ASICs (64) forwards the transfer inward (onto the PIXBUS (66)) until all targets receive a complete transfer, the target PI ASICs (64) will wait one cycle after the reply to either request the PIXBUS (66) or to discard the transfer. All target PI ASICS (64) must discard the transfer if there was a PI.sub.-- RSND.sub.-- N in the reply phase.
If the first beat was address type, then this operation is not snoopable. Therefore, there is only one intended target, and only the intended target is to assert either PI.sub.-- RCVR.sub.-- ACK.sub.-- N or PI.sub.-- RSND.sub.-- N. If no PI.sub.-- RCVR.sub.-- ACK.sub.-- N or PI.sub.-- RSND.sub.-- N is asserted during the reply cycle, Low priority operation types will be transformed into a NACK type operation while other types will result in a fatal error, since it implies there was no target node responding. In addition, if intended target observes PI.sub.-- RSND.sub.-- N asserted without it being the source of PI.sub.-- RSND.sub.-- N this is a fatal system error since only one node can respond to an address type beat.
Since the node field of a command-node beat is only used to parse operations incoming from a PIBUS (56), it is not necessary to forward that beat to the node's PIXBUS (66). All incoming node type beats will be dropped when placing the transfer in the P-Transaction Queues.
Note that all of the command information of the address type beat is identical to the command of the node type beat, and an address type is sent with every packet.
PIBUS-to-PIXBUS Queue & Buffer Selection
There are three PIBUS incoming queues in the PI (HI, MED, LOW). Header beat operation fields are parsed to determine which queue they should be sent to. The reason that there are three queues with different priorities is to order incoming requests and to promote forward progress. This is accomplished by ordering the completion of in-progress operations within the system ahead of new operations that will inject additional traffic into the system.
The HI priority queue is dedicated to operations that have made the furthest progress, and can potentially bottleneck the memory system and prevent forward progress of the operations that have already referenced memory on some module. Examples are CB.sub.-- INV.sub.-- RPLY, CB.sub.-- RPLY and WB (e.g. copyback-invalidate-reply, copy-back reply and write-back operations, respectively).
The MED priority queue is dedicated to operations that have made the furthest progress, and will result in completion or forward progress of the operations that have already referenced memory on some module. Examples are INV.sub.-- CMD and RD.sub.-- S.sub.-- REPLY (e.g. invalidate and read-shared-reply).
The lower priority queue is dedicated to those operations that when serviced will cause the injection of more, higher priority traffic into the system. These are operations which have not yet been acted upon by memory such as RD.sub.-- S and CI.sub.-- WR (e.g. read.sub.-- shared and cache inhibited.sub.-- write). Since the ORB (98) determines which queue gets granted a transfer on the PIXBUS there may be cases where the ORB allows some lower priority transfers to go ahead of higher priority transfers.
Requests are indicated when the PI asserts the signals PI.sub.-- X.sub.-- HI.sub.-- REQ.sub.-- N, PI.sub.-- X.sub.-- MED.sub.-- REQ.sub.-- N or PI.sub.-- X.sub.-- LOW.sub.-- REQ.sub.-- N for a high, medium or low request respectively. A PI (64) will initiate a request only if there is a valid entry in one of the queues.
Once a particular high, medium or low request has been made it remains asserted until the ORB (98) grants the PI (68) a bus tenure of that priority. Other ungranted requests will remain asserted. For high and low requests, de-assertion occurs in the cycle after receiving the grant even if there are more entries of that priority in the queue. The medium request will remain asserted if there are more mediums in the queue.
A new high or low request can only be made if the previous high or low transfer did not have a MC.sub.-- RESEND.sub.-- N signal asserted in the fourth cycle of the transfer. This signal represents a limitation that prevents the PI from streaming transfers of HI or LOW priority through the PI. However, full PIXBUS bandwidth can be utilized by the PI if there are two transfers of different priority ready to be transmitted to the PIXBUS. Also, the other PIs on the PIXBUS may request independently of each other so one of the four PIs (64) dropping it's request will have little impact on the PIXBUS bandwidth utilization.
A PI (64) will change the amount of time it takes to re-request the PIXBUS (66) on a resend. A backoff algorithm is used to progressively keep it from re-requesting the bus for longer periods of time. This helps prevent a PI (64) from wasting PIXBUS cycles resending operations to ASICS that recently have had full input queues. The progression of backoff time is as follows: 0,1,3,7,15,16,18,22,30,31,1,5,13, . . . . This is done by using a 5-bit decrementor and a starting value for each subsequent backoff is increased from the previous value by 1,2,4,8,1,2,4,8, . . . . The decrementor gets cleared if no resend is seen for the priority being backed-off or if a resend is seen for another priority. There is only one decrementor, and it always keeps track of the backoff needed for the last priority to get a resend.
PIXBUS Grant
Granting of PIXBUS (66) tenure is determined by the ORB (98) through assertion of the ORB.sub.-- GNT.sub.-- PI.sub.-- HI, ORB.sub.-- GNT.sub.-- PI.sub.-- MED, and ORB.sub.-- GNT.sub.-- PI.sub.-- LOW input signals. The ORB (98) will only grant tenure if the PI asserts PI.sub.-- X.sub.-- HI.sub.-- REQ.sub.-- N, PI.sub.-- MED.sub.-- REQ.sub.-- N, or PI.sub.-- X.sub.-- LOW.sub.-- REQ.sub.-- N signals for indicating, respectively, a high, medium or low priority request. Once granted the PI will select the HI, MED or LOW queue that corresponds to the grant. The PI will then transfer the oldest operation of that priority which the queue holds.
The ORB (98) may grant any PI (64) tenure without regard to any PI, PIXBUS, queue status, except when a PI (64) is making a low priority request while asserting PI.sub.-- CS.sub.-- REQ. In this case, the ORB (98) must respect the requesting PI's assertion of busy, via a PI.sub.-- X.sub.-- MED.sub.-- BUSY.sub.-- N queue status and not grant the requesting PI (64). PI.sub.-- CS.sub.-- REQ will be asserted anytime the PI (64) holds a low priority PI control space access operation in the queue. Low priority PI requests that are granted when PI.sub.-- CS.sub.-- REQ is asserted will result in a low priority queue transfer to the medium priority queue for control space access processing.
To ensure system coherency, it is necessary that the PI ASICs (64) prevent any IM type of operation who's cache block address matches any INV.sub.-- CMD or RD.sub.-- INV.sub.-- RPLY (i.e. invalidate or read.sub.-- invalidate.sub.-- reply), to be forwarded to a memory system. This prevention is called squashing. Squashing in the PI ASIC (64) is achieved by transforming such operation types to be a NOP type (i.e. no operation), where it will be treated as a NOP on the PIXBUS.
Any operations currently existing in the PI queues that are Intent to Modify (IM) type operations are squashed if the current incoming INV.sub.-- CMD or RD.sub.-- INV.sub.-- RPLY address matches any of the three possible low priority header buffer entries with any such operations. Any such operations which are currently being decoded are squashed if the associated cache block address matches any lNV.sub.-- CMD or RD.sub.-- INV.sub.-- RPLY address within the other header buffer or those which are currently being decoded.
Nodes (motherboards) that just received IM operations that resulted in a squash must assert a PI.sub.-- RSND.sub.-- N signal on the PIBUS to force potential receivers of such operations to squash any possible IM operations just received.
There are two different modes of operation for PIBUS transfers involving PI.sub.-- RSND.sub.-- N (or PI.sub.-- RCVR.sub.-- ACK.sub.-- N), i.e. resend or receiver acknowledge, responses.
If the operation is targeted at only one PIBUS resident (i.e. the first beat of transfer is an address transfer), then only the targeted PIbus interface is allowed to issue a PI.sub.-- RSND.sub.-- N (or PI.sub.-- RCVR.sub.-- ACk.sub.-- N) response. Therefore, when the PIBUS interface receives an address, and that address is resolved to reside on the node, it can be forwarded immediately. This is a non-broadcast type operation.
If the operation is potentially a multi-target (i.e. the first beat of transfer is a node bit field), then any targeted PIBUS interface is allowed to issue a PI.sub.-- RSND.sub.-- N (or PI.sub.-- RCVR.sub.-- ACK.sub.-- N) response. However, since the operation cannot be operated on until all parties involved are able to accept the operation (no one asserts PI.sub.-- RSND.sub.-- N), it cannot be forwarded immediately. This is a broadcast type operation.
PIXBUS Arbitration
PIXBUS (66, 68, 72, 76, 80, 88, 92, 94 of FIG. 2) arbitration takes three different forms, one for each of the incoming queue types. HI (high) priority arbitration takes precedence over MED (medium) priority arbitration. MED priority arbitration takes precedence over LOW (low) priority arbitration. MED priority arbitration uses a deli-counter ticket style mechanism to support the time ordering of transactions. HI and LOW priority arbitration are not confined to granting based on time ordering.
Requests are indicated when the PI (64) asserts any of the signals PI.sub.-- X.sub.-- HI.sub.-- REQ.sub.-- N, PI.sub.-- X.sub.-- MED.sub.-- REQ.sub.-- N or PI.sub.-- X.sub.-- LOW.sub.-- REQ.sub.-- N for a HI, MED or LOW request respectively. The ORB (98) array is responsible for servicing requests from the PI with a fairness algorithm. The ORB (98) array bestows bus tenure, i.e. issues a grant, to the PI (64) by driving a ORB GNT.sub.-- PI.sub.-- HI, ORB.sub.-- GNT.sub.-- PI.sub.-- MED and/or ORB.sub.-- GNT.sub.-- PI.sub.-- LOW signal.
For the MED priority input queue, the ORB (98) array maintains a Deli Count or "ticket" assigned upon the arrival of a remote MED priority type access targeted to the node. This arrival is indicated to the ORB (98) by the receiving PI (64) asserting a PI.sub.-- MED.sub.-- CUSTOMER signal. This indicates to the ORB (98) array that the PI (64) is utilized this ticket. The ORB array will then increment the ticket value, wrapping if necessary, for the next cycle. The actual ticket values are maintained in the ORB. The PI's PI.sub.-- ORDERED.sub.-- OP output is asserted upon the enqueing of a CI.sub.-- RD, CI.sub.-- WR or CI.sub.-- WR.sub.-- UNLK (i.e. cache-inhibited-read, write or write unlock) low priority operation type or INV.sub.-- CMD, or RD.sub.-- INV.sub.-- RPLY (i.e. invalidate or read.sub.-- invalidate.sub.-- reply) medium priority operation type into the PI queue(s). The PI.sub.-- ORDERED.sub.-- OP signal is used by the ORB (98) to give special priority to these types of operations when one of the PIs (64) has a MED priority operation that needs special ordering.
A PI.sub.-- NEW.sub.-- CUSTOMER.sub.-- N output is asserted by the PI on any enqueing of a MED priority or LOW operation into the queue.
A ONE.sub.-- TO.sub.-- GO signal is asserted by the PI (64) when it knows that the next beat is the last beat of the packet for which it was granted. The ORB (98) can use this signal to determine when the tenure is about to end.
An X.sub.-- XTEND signal is asserted by the PI (64) in all cycles it expects to have bus tenure after the first beat transferred. The PIXBUS receiver can use this signal to determine when the tenure has ended.
The PI (64) removes Medium priority operations from its queue in the cycle after its operation transfer was granted since there is no MC.sub.-- RESEND.sub.-- N possible for medium priority transfers. That is, the memory controller, as described in detail hereinafter, will not resend medium priority data transfers. Any data associated with the Medium operation transfer is removed as it is transferred. High and Low priority operations cannot be removed until after the MC.sub.-- RESEND.sub.-- N signal is checked in the reply cycle. If there is a resend, the transfer completes as it would without the resend. The only difference is that the operation information and associated data is retained in the PI (64) for re-transmitting when re-granted.
PIXBUS-to-PIBUS Traffic
The PI (64) determines when a transfer starts on the PIBUS by observing an X.sub.-- TS signal which accompanies the first beat of a packet transfer.
The PI (64) is responsible for examining all traffic on the PIXBUS, and responding to specific operations that it is involved in. There are three different ways that an operation can be decoded as targeted to a particular PI. These are: RMT.sub.-- SNP Bit Compare, Requester ID Node Compare and Address Decode.
The first beat of a transaction packet (also known as a Header beat) is always either a node type or an address type. If the first beat is a node type and an RMT.sub.-- SNP bit is set, then the second beat is always an address type. Otherwise, it is just an address type. Information in an operation field determines which combination of decode mechanisms to use. These are summarized in the Table of PIXBUS Operation Decode and Queue Assignment, FIG. 6. PIXBUS operations are the same format as those of the PI BUS (56). The only exception is that inbound node type operations have their node headers stripped. Inbound node type operations will not have the RMT.sub.-- SNP bit set.
If the first beat is a node type, then this transfer has come from a memory controller's directory control logic. Transfers require snooping local to all nodes which have their respective bit set in a 16-bit node field. To distinguish between a snoop which was generated on this node and one which as already been forwarded to the PIBUS, the RMT.sub.-- SNP bit is used. If the bit is set, and this beat is a node type, then the PI (64) is responsible for requesting the PIBUS and forwarding the transfer inward. If the RMT.sub.-- SNP bit is not set, and this beat is a node type, then the PI (64) will only check the packet's parity.
If the first beat is an address type, then the operation field is parsed to determine whether to look at the requester ID or the address fields. This determination is summarized in the Table of FIG. 6.
If the first beat is an address type, and the operation field implies the requester ID match the PI's node ID register, then the PI (64) is responsible for requesting the PIBUS and forwarding the transfer outward. If the first beat is a address type, and the command field does not imply the requester ID compare, then the address is parsed to determine if the PI's node is the target of the transfer. If the physical address range compare DOES NOT result in a match, then the PIBUS is requested, and the transfer is forwarded outward. If the address range compare DOES NOT result in a match for the control, internal devices, or I/O channel mappings, the PIBUS is requested and the transfer is forwarded outward. If the address range compare DOES result in a match for the PI control space mappings and an ASIC ID matches, the PIBUS is requested and the transfer is forwarded outward. This match is indicated with a PI.sub.-- OUR.sub.-- ASIC signal. Address decode for the PIXBUS is the same as the PIBUS address decode.
PI BUS Selection
If a PIXBUS operation needs to be forwarded to the PIBUS the four PIs must determine which PI (64) will accept the operation. This filtering process is done using information from the address beat of a transaction header. For non-PI control space operations an address bit 19 is XORed with an address bit 7 and address bit 18 is XORed with address bit 6. The resulting two bit code is used to be compared with what codes will be allowed by ADDR.sub.-- 76.sub.-- EN configuration bits. If that code is allowed by the PI (64) the operation will be accepted by the PI. For PI control space operations only address certain bits, i.e. 7,6, which are used as the two bit code.
There are three PIXBUS incoming queues in the PI (HI, MED, LOW). Header beat Operation fields are parsed to determine which queue they should be sent to. The three queues have different priorities. Anything residing in the HI priority queue has priority over everything in the MED & LOW priority queue. Anything residing in the MED priority queue has priority over everything in the LOW priority queue. The reason that there are three queues with different priorities is to order incoming requests and to promote forward progress. This is accomplished by ordering the completion of in-progress operations within the system ahead of new operations that will inject additional traffic into the system.
The HI priority queue is dedicated to operations that have made the furthest progress, and can potentially bottleneck the memory system and prevent forward progress of the operations that have already referenced memory on some module. Examples are CB.sub.-- INV.sub.-- RPLY, CB.sub.-- RPLY, and WB, as discussed hereinbefore.
The MED priority queue is dedicated to operations that have made the furthest progress, and will result in completion or forward progress of the operations that have already referenced memory on some module. Examples are INV.sub.-- CMD and RD.sub.-- S.sub.-- REPLY.
The lower priority queue is dedicated to those operations that when serviced will cause the injections of more higher priority traffic into the system. These are operations which have not yet been acted upon by memory such as RD.sub.-- S & Cl.sub.-- WR.
All incoming packet transfers are put in their respective priority queues. The only exception is that for Cl.sub.-- RDs and Cl.sub.-- WRs which are targeted to the PI's control space and received from the PI (64) itself. This is the case of remote PI control space access. In this case the low priority operation is put into the Medium queue instead of the Low queue. This is done to prevent deadlocking situations involving remote PI control space access.
PIBUS requests are asserted with the PI.sub.-- P.sub.-- REQ.sub.-- N<7:0> signals. Once granted the PI (64) must drop it's request. New requests are only asserted when PIBUS arbitration logic allows a new window (See PIBUS Arbitration). There must be a valid queue entry in either the high, medium or low queue before the PI (64) will request the PIBUS. A request may be delayed if there is a resend reply on the PIBUS bus.
Selection of which of the high, medium or low queue for output depends on the setting of a P.sub.-- OUT.sub.-- SHUF.sub.-- ARB state, and which queues contain valid entries. If P.sub.-- OUT.sub.-- SHUF.sub.-- ARB=0 then all valid high queue entries will get sent before all medium and low entries and all medium entries will get sent before all low entries. Priority will be ordered HI, MED, LOW.
If there is a resend reply on the PIBUS for an operation of a given priority then the PI (64) will shift its priority scheme to MED, LOW, HI) and select the next valid priority operation for output next time. If there is also a resend reply for this operation then the PI (64) will shift again to LOW, HI, MED. If there is yet another resend reply the PI (64) will shift again to HI, MED, LOW and so forth until an operation is sent without a resend reply. Once sent the priority goes back to the original HI, MED, LOW priority scheme.
If the P.sub.-- OUT.sub.-- SHUF.sub.-- ARB=1, then a shuffling of the queue priority occurs like that of the shuffling done for PIBUS arbitration. For one operation the priority will be HI, MED, LOW, then the next will be MED, LOW, HI, then LOW, HI, MED, and back to HI, MED, LOW.
To ensure system coherency, it is necessary that the PI (64) ASICs prevent any intent to modify (IM) type of operation who's address matches any lNV.sub.-- CMD or RD.sub.-- INV.sub.-- RPLY to be forwarded to a memory system. As discussed hereinbefore, this prevention is called squashing. Squashing in the PI ASIC will be achieved by transforming the IM operation to a NOP type operation where it will be treated as a NOP on the PIXBUS.
Any IMs currently existing in the PI (64) queues are squashed if the current incoming INV.sub.-- CMD or RD.sub.-- INV.sub.-- RPLY address matches any of the three possible low priority header buffer entries with IMs. Any IMs which are currently being decoded are squashed if the IM address matches any INV.sub.-- CMD or RD.sub.-- INV.sub.-- RPLY address within the other Header buffer or those which are currently being decoded.
Unlike the PIBUS-to-PIXBUS transfer, there is no required latency in requesting the PIBUS. This is because there are no PI (64) targeted PIX transactions which can be signalled to be resent. The ORB (98) will guarantee that there is always enough PIXBUS input queue space to accept a transaction which it grants onto the PIXBUS. The only exception to this rule is the memory controller (MC) input queue which can cause a MC.sub.-- RESEND. However, the transaction which is resent by the MC will never be a PI (64) targeted transaction and so it can be assumed that if a PI (64) detects a PIBUS bound transaction it will complete without a resend response.
PIBUS arbitration is based on a "Windowed-Priority" distributed arbitration with fairness. What this means is that there are specific times (windows) where the PI.sub.-- REQ.sub.-- P.sub.-- N (request) signals are sampled and then grants associated with each request are prioritized based on a pre-determined code known as the shuffle code.
Since this arbitration logic is distributed, each PIBUS requester knows the request status of all the other requesters on the bus. The local requester only needs to know if a particular grant is for itself or another requester.
The shuffle code used in the PI (64) is simply a 3-bit counter. It is initialized on reset with the lower three bits of a NODE ID value which is unique for each NODE. The NODE ID counter is also initialized at reset with the NODE ID. Shuffles are allowed if configured to do so, or after the first PIBUS transfer window and then both counters count up by one anytime all requests in a given window have been granted.
The PIs (64) will only assert new requests on these window boundaries. As PIs are granted within a window, the PI (64) must deassert the request that was made in that window. A simplified block diagram of the PI Arbitration Logic is shown in FIG. 8.
The shuffle code/counter (200) is used as a MUX select for each of the eight 8:1 multiplexers (202). Each 8:1 MUX has a specific permutation of request signals. The output of the multiplexers is connected to a 8-bit priority encoder (204). The
3-bit output of the priority encoder is compared against the NODE ID counter 206 output. If the shuffled prioritized encoded request matches the NODE ID count then the PI (64) is granted the PIBUS tenure.
The PI.sub.-- ANY.sub.-- P.sub.-- GNT signal is used by the P.sub.-- SLV.sub.-- SM to know that a new PI (64) BUS transfer will begin next cycle.
The PI (64) ASIC will only enable one PI.sub.-- P.sub.-- REQ.sub.-- N<7:0> corresponding to the node number at which the PI (64) resides. All others will be configured as input only in normal mode operation.
The PI (64) expects an acknowledge (Pl.sub.-- RCVR.sub.-- ACK.sub.-- N) in the third cycle of the transfer it originates. If there is no acknowledge for a low priority operation, then the PI (64) will create a NACK type packet back to the requester. For all other operation priorities a fatal error will result.
The PI (64) also expects a PI.sub.-- RSND.sub.-- N (if any) in the third cycle of the transfer it originates. Note that the PI (64) always sends the entire transfer to the PIBUS even if there is a Pl.sub.-- RSND.sub.-- N.
The PI (64) removes an operation from its queue in the cycle after its operation transfer was acknowledged with no resend (Pl.sub.-- RCVR.sub.-- ACK.sub.-- N=0, PI.sub.-- RSND.sub.-- N=1). If there is a resend, the transfer completes as it would without the resend. The only difference is that the operation info and associated data is retained (or converted to NACK type) in the PI (64) for re-transmitting when re-granted. If a PIBUS is deconfigured then all the PIs on that PIBUS must be deconfigured even if they are fully functional.
MEMORY CONTROLLER/MC ASIC
The memory system in the CCNUMA architecture according to the invention, illustrated in FIG. 9, is also implemented via an ASIC, referred to as a memory controller (MC) (220). Generally, the MC provides the interface to physical memory (222) for the multiprocessor system, and maintains memory system coherency by implementing a coherency directory (224) for memory. The MC comprises a plurality of functional elements that are described hereinafter.
The Memory Controller chip (MC) (82, FIG. 2) controls the execution of physical memory operations. This involves managing both the Directory which maintains system coherency and the memory data store DRAMs. The MC operates at 50 MHz, the standard system clock speed. It is capable of receiving a new packet every 20 ns until its queues are full. The MC is designed to operate on a split transaction, packetized bus based on the architecture defined herein. It is estimated that the MC needs to deliver 115 MB/sec of memory bandwidth for the system according to the invention. This includes a 30% overhead budget.
There is one MC ASIC per motherboard board (52), controlling from 0 to 512 MegaBytes, or 1/2 a GigaByte of local memory. The MC, illustrated in FIG. 10, processes memory transaction packets that are driven onto the MCBUS by the BAXBAR. The packets may have originated on any of the local busses or on the PIBUS. To ensure packet ordering needed for coherency, all packets affecting the same block address will always use the same PIBUS. The MC checks packet addresses to decode if they address near or far memory. The MC will accept only near memory packets. The MC accepts high and low priority packets and issues only medium priority packets. Packets issued by the MC can never be retried.
The MC has a four packet input queue (230) and four packet output queue (232). only the packet header beats are enqueued in the MC. The data beats are enqueued in EDiiACs (described in detail hereinafter), which include the data queues (FIFOs) for the memory DRAM data store. The one exception to this are Local Register writes, which are entirely enqueued in the MC. Memory responses (both data and coherency commands) are driven onto the MCBUS as a packet. The MC (with the help of the EDiiACs) performs ECC error detection and correction on DRAM data and checks parity on MCBUS packets. There are two EDiiACs per MC. Each of the EDiiACs has a 64-bit data path and an 8-bit ECC path. When the DRAMs are read or written, the EDiiACs act in parallel to provide a 128-bit data path for the DRAMs. When the EDiiACs drive or receive data from the MUD.sub.-- BUS (i.e. MUD.sub.-- 1, MUD.sub.-- S, used to connect the BaxBar ASICs (70) to two EDiiAC ASICs (96)), they operate in series, each being active every other cycle. This provides a 64 bit data path to the MUD.sub.-- BUS and allows a data beat every cycle, even though each EDiiAC by itself can only drive one data beat every other cycle.
The MC provides all the control for the EDiiACs and also provides the data store addresses, row address select (RAS), column address select (CAS) and other DRAM control signals.
MC Directory Manager
The MC includes a Directory Manager functional element that maintains coherency information on each block of physical memory. The information is stored in the directory which is implemented in DRAM. The directory indicates which system nodes (a motherboard is equivalent to a node) hold valid cached copies of memory blocks. It also indicates if a node has a modified version of a memory block and if a memory block is currently locked for the use of a single processor. For each packet that requests memory access, the Directory Manager will examine the corresponding directory information before allowing memory to be altered. When necessary to maintain coherency, the Directory Manager will issue invalidates and copyback commands. The Directory Manager will update the directory information before servicing the next memory request.
MC Directory
The directory that the directory manager manages maintains system coherency. It stores 11 bits of coherency information for every block of data. Each directory entry describes the state of one memory block (also called a cache line). The coherency information stored in the directory is at a node level. Coherency issues below the node level are the responsibility of the node itself. The directory state is stored in a combination of a Directory Store (DTS) and Copyback Contents Addressable Memory (Copyback CAM or CAM), which are described hereinafter.
For each memory access that the MC performs, it must look up the memory address in both the DTS and the CAM to determine the coherency state of the block. The state determines what response the MC will make to the memory request. A memory block can be in any of the five following states:
UNUSED. This state means that the block is not resident in any caches in the system. The only valid copy of the block is in memory. All valid bits and the modify bit are zero in this state.
SHARED. This state means that there may cache line are the same as the copy held by the memory. One or more valid bits in the directory are set and the modified is zero.
MODIFIED. This state means that one and only one cache in the system has a copy of the cache line. This cache's copy is assumed to be different than the copy held by the memory. One valid bit is set along with the modified bit in this state.
LOCKED. This state means that this cache line has been locked by a system requestor. The cache line is unavailable to other requestors until it is unlocked. This state is a cache inhibited state so no shared copies exist. The lock bit is set in this state and all vbits are zero.
BUSY This state means that this cache line has an outstanding copyback command. The directory entry bits are unchanged when a copyback command is issued, so the modified bit and the vbit of the node which currently holds the data will still be set to one. The busy state is set by loading the address, opcode and requestor ID of the request in to the Copyback CAM.
These five states are qualified with the UNORDERED bit which indicates whether the cache line is subject to packet ordering constraints. This affects whether local replies need to travel via the PIBus, but does not affect the type of reply packet or the coherent directory state.
MC Directorv Store
The memory's directory information is stored in DRAMs controlled by the MC ASIC. Each entry in the Directory Store (DTS, 224, FIG. 9) corresponds to a block in the main DRAM data store. Each DTS entry is protected with 6 bits of ECC, used to provide single and double bit error detection and single bit error correction. The DTS is addressed with a 12-bit address bus that is separate from the address bus for the data store. These separate busses are needed to allow multiple accesses to the directory (read and write) while a single multiple-beat block is being accessed in the data store. The DTS will may be implemented with 32 MB DRAM SIMMs, which would be incompletely used, since only 24 MBs are needed.
For each DTS entry, bit assignments are as follows:
Bit[10]--Unordered
Bit[9]--Lock
Bit[8]--Mod
Bit[7:0]--Vbits (Node 0=Bit 0)
Vbits--8 bits--one valid bit for each possible node. Vbit=1 indicates that the corresponding node has a valid copy of this block.
Mod--1 bit--the modified bit. Mod=1 indicates that one node has a modified copy of this block and the data in memory is stale. When Mod=1, there must be one and only one Vbit set.
Lock--1 bit--the lock bit. Lock=1 indicates that a node has locked the block for its exclusive use. When the lock bit is set, there can not be any Vbits set.
Unordered--1 bit--the unordered bit. Unordered=1 indicates that any local read replies from this block must be sent via the backplane to insure ordering with any outstanding invalidates.
Busy--A Copyback CAM hit. A directory entry is busy if its block address matches the tag stored in a valid Copyback CAM entry. Such a CAM hit indicates that there is an outstanding copyback request for this block. The memory DRAMs hold stale data for this block so this block is unusable until copyback data is received.
Basic Memory Read Access
The following is a detailed description of how a read request is processed by the MC. A Read request packet is present on the MCBUS. The MC registers the first word, which is the header, into an Input Register portion of local registers (226). The packet address and command are inspected and since the packet is of interest to the memory it is passed through the Input Queue (230) to the DRAM Controller (232). The address is passed through the RAS/CAS address logic of the DRAM Controller (232), where it is converted into a two part 12-bit DRAM address. The RAS and CAS strobes are also created there, as are the WRITE and CHIP.sub.-- SELECT signals. The address is then clocked into the both Address Registers (234) in the address logic (232), one of which addresses the Data Store DRAMS and the other addresses the DTS DRAMS. At this point the two registers hold the same address and the Data Store and the DTS will be read simultaneously.
The Directory bits for that address are read from the DTS and registered into the Directory data path (RDP) input register (236). They are then passed through the ECC checking logic (238) and corrected if necessary. The directory bits are then pass to the Header and Directory Decode Module (240) where it is determined what actions must be taken to maintain coherency. New directory bits are generated and passed through ECC generation and into the RDP (236) output register. From there the new directory bits and ECC are written into the DTS. The DTS reads and writes are only one beat each, while the read of the Data Store are 4 beats. Therefore the DTS write can be started while the Data Store read is still in progress. Thus the need for separate address registers for the DTS and Data Store.
Once the directory bits are decoded, the Header Encode Module (242) generates a 64-bit