|
| United States Patent | 7165257 |
| Musoll , ; et al. | January 16, 2007 |
|
Context selection and activation mechanism for activating one of a group of inactive contexts in a processor core for servicing interrupts
|
|
| Abstract | |
|
| A logic system in a data packet processor is provided for selecting and releasing one of a plurality of contexts. The selected and released context is dedicated for enabling the processing of interrupt service routines corresponding to interrupts generated in data packet processing and pending for service. The system comprises, a first determination logic for determining control status of all of the contexts, a second determination logic for determining if a context is idle or not, a selection logic for selecting a context and a context release mechanism for releasing the selected context. Determination by the logic system that all contexts are singularly owned by an entity not responsible for packet processing and that at least one of the contexts is idle, triggers immediate selection and release of an idle one of the at least one idle contexts to an entity responsible for packet processing. |
|
| Inventors: | Musoll; Enrique (San Jose, CA), Nemirovsky; Mario (Saratoga, CA), Melvin; Stephen (San Francisco, CA) |
| Assignee: | MIPS Technologies, Inc. (Mountain View, CA) |
| Appl. No.: | 09/927,129 |
| Filed: | August 10, 2001 |
| PCT Pub Date: | January 23, 2007 |
|
|
| Current U.S. Class: | | | 718/108 710/19 712/229 |
| Current International Class: | | | G06F 9/46 (20060101) G06F 3/00 (20060101) G06F 9/00 (20060101) |
| Field of Search: | | | 370/230-232,392,422-430 709/249,224 718/100-108 710/19 712/229 |
|
| [References Cited] - [Referenced By] | |
|
|
|
| U.S. Patent Documents | |
| 20010004755 | June 2001 | Levy et al. | |
| 20010005253 | June 2001 | Komatsu | |
| 20010024456 | September 2001 | Zaun et al. | |
| 20010043610 | November 2001 | Nemirovsky et al. | |
| 20010052053 | December 2001 | Nemirovsky et al. | |
| 20020016883 | February 2002 | Musoll et al. | |
| 20020049964 | April 2002 | Takayama et al. | |
| 20020054603 | May 2002 | Musoll et al. | |
| 20020071393 | June 2002 | Musoll | |
| 20020083173 | June 2002 | Musoll et al. | |
| 20020124262 | September 2002 | Basso et al. | |
| 20040015598 | January 2004 | Lin | |
| 20040148382 | July 2004 | Narad et al. | |
| 20040172471 | September 2004 | Porter | |
| 20040172504 | September 2004 | Balazich et al. | |
| 20040213251 | October 2004 | Tran et al. | |
| 20050061401 | March 2005 | Tokoro et al. | |
| 20060036705 | February 2006 | Musoll et al. | |
| 4200927 | April 1980 | Huges et al. | |
| 4707784 | November 1987 | Ryan et al. | |
| 4942518 | July 1990 | Weatherford et al. | |
| 5023776 | June 1991 | Gregor | |
| 5121383 | June 1992 | Golestani | |
| 5291481 | March 1994 | Doshi et al. | |
| 5408464 | April 1995 | Jurkevich | |
| 5465331 | November 1995 | Yang et al. | |
| 5471598 | November 1995 | Quattromani et al. | |
| 5521916 | May 1996 | Choudhury et al. | |
| 5559970 | September 1996 | Sharma | |
| 5619497 | April 1997 | Gallagher et al. | |
| 5634015 | May 1997 | Chang et al. | |
| 5659797 | August 1997 | Zandveld et al. | |
| 5675790 | October 1997 | Walls | |
| 5708814 | January 1998 | Short et al. | |
| 5724565 | March 1998 | Dubey et al. | |
| 5737525 | April 1998 | Picazo et al. | |
| 5784649 | July 1998 | Begur et al. | |
| 5784699 | July 1998 | McMahon et al. | |
| 5796966 | August 1998 | Simcoe et al. | |
| 5809321 | September 1998 | Hansen et al. | |
| 5812810 | September 1998 | Sager | |
| 5892966 | April 1999 | Petrick et al. | |
| 5918050 | June 1999 | Rosenthal et al. | |
| 5951679 | September 1999 | Anderson et al. | |
| 5978570 | November 1999 | Hillis | |
| 5978893 | November 1999 | Bakshi et al. | |
| 5987578 | November 1999 | Butcher | |
| 6009516 | December 1999 | Steiss et al. | |
| 6016308 | January 2000 | Crayford et al. | |
| 6023738 | February 2000 | Priem et al. | |
| 6047122 | April 2000 | Spiller | |
| 6058267 | May 2000 | Kanai et al. | |
| 6070202 | May 2000 | Minkoff et al. | |
| 6073251 | June 2000 | Jewett et al. | |
| 6088745 | July 2000 | Bertagna et al. | |
| 6131163 | October 2000 | Wiegel | |
| 6151644 | November 2000 | Wu | |
| 6157955 | December 2000 | Narad et al. | |
| 6169745 | January 2001 | Liu et al. | |
| 6195680 | February 2001 | Goldszmidt et al. | |
| 6219339 | April 2001 | Doshi et al. | |
| 6219783 | April 2001 | Zahir et al. | |
| 6223274 | April 2001 | Catthoor et al. | |
| 6226680 | May 2001 | Boucher et al. | |
| 6247040 | June 2001 | Born et al. | |
| 6247105 | June 2001 | Goldstein et al. | |
| 6249801 | June 2001 | Zisapel et al. | |
| 6253313 | June 2001 | Morrison et al. | |
| 6263452 | July 2001 | Jewett et al. | |
| 6381242 | April 2002 | Maher, III et al. | |
| 6389468 | May 2002 | Muller et al. | |
| 6438135 | August 2002 | Tzeng | |
| 6453360 | September 2002 | Muller et al. | |
| 6460105 | October 2002 | Jones et al. | |
| 6483804 | November 2002 | Muller et al. | |
| 6502213 | December 2002 | Bowman-Amuah | |
| 6523109 | February 2003 | Meier | |
| 6529515 | March 2003 | Raz et al. | |
| 6535905 | March 2003 | Kalafatis et al. | |
| 6614796 | September 2003 | Black et al. | |
| 6625808 | September 2003 | Tarditi | |
| 6640248 | October 2003 | Jorgensen | |
| 6650640 | November 2003 | Muller et al. | |
| 6738371 | May 2004 | Ayres | |
| 6738378 | May 2004 | Tuck, III et al. | |
| 6813268 | November 2004 | Kalkunte et al. | |
| 6820087 | November 2004 | Langendorf et al. | |
| 7065096 | June 2006 | Musoll et al. | |
|
| Foreign Patent Documents | |
| WO 03/005645 | Jun., 2002 | WO | |
|
| Other References | |
Knuth, Donald E., "The Art of Computer Programming, Third Edition, vol. 1, Fundamental Algorithms", "Sec. 2.5 Dynamic Storage Allocation", 1997, pp. 435-456, Addison-Wesley, US. cited by other .
Diefendorff, Keith, K7 Challenges Intel, Microprocessor Report, Oct. 26, 1998, vol. 12, No. 14, US. cited by other .
U.S. Appl. No. 09/608,750, Nemirovsky et al. cited by other .
U.S. Appl. No. 09/602,279, Nemirovsky et al. cited by other .
Melvin et al., "Extended Instruction Set for a Packet Processing Applications," Jul. 5, 2001, Disclosure Document #496559, USPTO. cited by other .
Musoll et al., "Hardware Algorithm for Allocating and De-Allocating Consecutive Blocks of Memory," Apr. 3, 2001, Disclosure Document #491557, USPTO. cited by other .
Musoll et al., "Mechanism to Overflow Packets to a Software Controlled Memory When They Do Not Fit into a Hardware Controlled Memeory,", Jul. 3, 2001, Disclosure Document #496391, USPTO. cited by other .
Musoll et al., "Mechanism for Allowing a Limited Packet Head and/or Tail Growth Without Moving the Packet to a Different Memeory Location," Apr. 16, 2001, Disclosure Document #492429, USPTO. cited by other .
Musoll, Enric, "Functional Validation of a Packet Management Unit," May 18, 2001, Disclosure Document #429011, USPTO. cited by other .
Sampath et al., "Mechanism to Un-speculatively Pre-fetch Instructions from the Thread Associated to a Packet," Apr. 2, 2001, Disclosure Document #491527, USPTO. cited by other .
Yamamoto, Wayne. An Analysis of Multistreamed, Superscalar Processor Architectures. University of California Santa Barbara Dissertation. Dec. 1995. Santa Barbara, US. cited by other .
Yamamoto et al. "Increasing Superscalar Performance Through Multistreaming." Parallel Architectures and Compilation Techniques (PACT '95). 1995. cited by other .
The PowerPC Architecture: A Specification for a New Family of RISC Processors. 2.sup.nd Ed. May 1994. pp. 70-72. Morgan Kaufmann. San Francisco, US. cited by other .
MC68020 32-Bit Microprocessor User's Manual. 3.sup.rd Ed.. 1989. pp. 3-125, 3-126, and 3-127. Prentice Hall, NJ, US. cited by other .
Potel, M. J. "Real-Time Playback in Animation Systems." Proceedings of the 4.sup.th Annual Conference on Computer Graphics and Interactive Techniques. 1977. pp. 72-77. San Jose, CA, US. cited by other .
ARM Architecture Reference Manual. 1996. pp. 3-41, 3-42, 3-43, 3-67, and 3-68. Prentice Hall, NJ, US. cited by other .
ESA/390 Principles of Operation. IBM Online Publications Center Reference No. SA22-7201-08. Table of Contents and paras. 7.5.31 and 7.5.70. IBM Corporation. Boulder, Co, US. cited by other .
MC88110 Second Generation RISC Microprocessor User's Manual. 1991. pp. 10-66, 10-67, and 10-71. Motorola, Inc. cited by other .
Diefendorff et al. "Organization of the Motorola 88110 Superscalar RISC Microprocessor," IEEE Journal of Microelectronics. Apr. 1992. pp. 40-63. vol. 12, No. 2. IEEE. New York, NY, US. cited by other .
Kane, Gerry. PA-RISC 2.0 Architecture. 1996, pp. 7-106 and 7-107. Prentice Hall. NJ, US. cited by other .
Diefendorff et al. "AltiVec Extension to PowerPC Accelerates Media Processing." IEEE Journal of Microelectronics. vol. 20, No. 2 (2000): pp. 85-95. cited by other .
Grunewald et al. "Towards Extremely Fast Context Switching in a Block Multithreaded Processor." Proceedings of EUROMICRO 22, 1996. pp. 592-599. cited by other .
Bradford et al. "Efficient Synchronization for Multithreaded Processors." Workshop on Multithreaded Execution, Architecture, and Compilation. Jan.-Feb. 1998. pp. 1-4. cited by other .
Pai et al. "An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors." Proceedings of ASPLOS-VII, Oct. 1996: pp. 12-23, ACM, Inc. cited by other .
Yoaz et al. "Speculation Techniques for Improving Load Related Instruction Scheduling." 1999. pp. 42-53, IEEE. cited by other .
Kessler, R. E. "The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 MHz." Aug. 1998. cited by other .
Donalson et al. "DISC: Dynamic Instruction Stream Computer, An Evaluation of Performance." 26th Hawaii Conference on Systems Sciences. vol. 1. 1993. pp. 448-456. cited by other .
Nemirovsky et al. "DISC: Dynamic Instruction Stream Computer." ACM. 1991. pp. 163-171. cited by other .
Musoll et al. Mechanism to Prevent the Download of a Packet with Pending Writes Affecting Its Data. Apr. 11, 2001. Disclosure Document #492430, USTPO. cited by other .
Ungerer et al. A Survey of Processors with Explicit Multithreading. ACM Computing Surveys, vol. 35, No. 1. Mar. 2003. pp. 29-63. cited by other .
U.S. Appl. No. 09/737,375, Mario Nemirovsky et al. cited by other .
U.S. Appl. No. 60/181,364, Mario Nemirovsky et al. cited by other .
Enrique Musoll et al. Mechanism to Activate a Context When No Stream is Running in a Multi-Streaming Processing Core Apr. 16, 2001, Disclosure Document #492431, USPTO. cited by other .
U.S. Appl. No. 09/591,510, filed Jun. 12, 2000, Gelinas et al. cited by other.~
|
|
|
| Primary Examiner: | Banankhah; Majid
|
| Assistant Examiner: | Suryawanshi; Suresh K
|
| Attorney, Agent or Firm: | Boys; Donald R. Huffman; James W.
|
|
|
| Parent Case Text | |
|
CROSS-REFERENCE TO RELATED DOCUMENTS
The present invention is a continuation in part (CIP) to a U.S. patent application Ser. No. 09/737,375 entitled "Queuing System for Processors in Packet Routing Operations" and filed on Dec. 14, 2000, which is included herein by reference. In addition, Ser. No. 09/737,375 claims priority benefit under 35 U.S.C. 119 (e) of Provisional Patent Application Ser. No. 60/181,364 filed on Feb. 8, 2000, and incorporates all disclosure of the prior applications by reference. |
|
|
| Claims | |
|
What is claimed is:
1. A logic system in a data packet processor for selecting and releasing a context among a plurality of such contexts, the selected and released context dedicated for enabling processing of interrupt service routines corresponding to interrupts generated in data packet processing and pending for service, the system comprising: logic configured to: determine control status of all of the contexts; determine if a context is idle or not; select a context; and release the selected context; wherein in response to determining that none of the plurality of contexts are owned by an entity responsible for packet processing (SPU) and that at least one of the contexts is idle, the logic is configured to trigger immediate selection and release of one of the at least one idle contexts to the entity responsible for packet processing upon determination that none of the contexts are owned by the entity responsible for packet processing and that there are no idle contexts, aborts pre-loading of a given context of the contexts to render the given context idle.
2. The logic system of claim 1 wherein the data packet processor is part of a data packet routing system.
3. The logic system of claim 2 wherein the data packet routing system is connected to a data packet network.
4. The logic system of claim 3 wherein the data packet network is the Internet network.
5. The logic system of claim 1 wherein the logic is integrated in a single hardware device.
6. The logic system of claim 5 wherein the hardware device is a register transfer unit.
7. The logic system of claim 1 further comprising a memory marker pointing to a memory address of a pre-defined instruction thread for invoking a stream to run in the released context.
8. The logic system of claim 1 wherein the selected and released context is dedicated to the servicing of pending interrupts.
9. The logic system of claim 7 wherein the memory containing the pre-defined thread is a cache memory used for containing program instructions.
10. The logic system of claim 7 wherein the memory containing the pre-defined thread is a dedicated memory section of a processing core, the core also containing the contexts.
11. The logic system of claim 1 wherein a priority scheme is used to select an idle context in the case of more than one idle context.
12. The logic system of claim 11 wherein the priority scheme comprises a prediction as to which of a plurality of idle contexts is likely to have access to needed functional units.
13. A method for selecting and releasing a context among a plurality of contexts to a processing entity for processing interrupt service routines associated with interrupts generated during data packet processing, the method comprising: (a) determining that none of the plurality of contexts are under control of the processing entity; (b) determining that there is at least one idle context within the plurality of contexts; (c) selecting a first context of the at least one idle contexts for release; and (d) releasing the first context to the processing entity, so the processing entity may process service routines and a memory marker is sent along with a release notification, the marker pointing to a location in memory where a pre-defined instruction thread exists to be used by the processing entity for invoking a stream to run in the released context.
14. The method of claim 13 further comprising a step (e) for processing low-priority tasks before releasing the selected context.
15. The method of claim 13 wherein the data packet processing is performed by a processor of a data packet router connected to a data packet network.
16. The method of claim 14 wherein the data packet network is the Internet network.
17. The method of claim 13 wherein in step (a) the determination of control status of the contexts is made by logic on a hardware device.
18. The method of claim 13 wherein in step (b) the determination of idle status of the contexts is made by logic on a hardware device.
19. The method of claim 13 wherein in steps (a) and (b) are enabled by a single hardware device containing the appropriate logics.
20. The method of claim 19 wherein in steps (c) and (d) are also enabled by the same hardware device containing the appropriate logics.
21. A method for processing service routines associated in a data packet processing system using a dedicated context selected among a plurality of contexts not under control of a data packet processor, the method comprising: (a) receiving, at the processor, notification of a selected context about to be released and a memory marker pointing to an instruction thread in memory; (b) fetching the instruction thread designated by the received marker; (c) executing the instruction thread in the released context; (d) detecting at least one pending interrupt that requires processing; and (e) running the corresponding service routine or routines that satisfies the at least one interrupt.
22. The method of claim 21 further comprising a step for calling and executing other service routines after all interrupts are serviced.
23. The method of claim 21 wherein the data processing system is a data packet router connected to a data packet network.
24. The method of claim 23 wherein the data packet network is the Internet network.
25. The method of claim 21 wherein in step (a) the instruction thread has at least one instruction invoking a processing stream.
26. The method of claim 21 wherein in step (a) the memory marker is a program counter number of the beginning location of the thread in memory.
27. The method of claim 21 wherein in step (a) the memory is an instruction cache memory containing a program of instructions.
28. The method of claim 21 wherein in step (a) the memory is a dedicated memory in the processing core, the core also containing the contexts.
29. The method of claim 21 wherein in step (b) fetching the instruction thread occurs before the context is actually under the control of the processor.
30. The method of claim 21 wherein in step (c) the instruction thread contains at least one instruction invoking a stream.
31. The method of claim 30 wherein in step (c) the instruction thread further contains an instruction for checking for pending interrupts.
32. The method of claim 21 wherein in step (d) the interrupts are taken by priority in the event of more than one pending.
33. The method of claim 21 wherein steps (a) through (e) are completed when no other contexts are under control of the processor.
|
|
|
| Description | |
|
FIELD OF THE INVENTION
The present invention is in the field of digital processing and pertains in particular to apparatus and methods for processing packets in routers for packet networks, and yet more particularly to apparatus and methods for activating one of a group of contexts that are, at the time of activation, underutilized by the processing core.
BACKGROUND OF THE INVENTION
The well-known Internet network is a notoriously well-known publicly accessible communication network at the time of filing the present patent application, and arguably the most robust information and communication source ever made available. The Internet is used as a prime example in the present application of a data-packet-network which will benefit from the apparatus and methods taught in the present patent application, but is just one such network, following a particular standardized protocol. As is also very well known, the Internet (and related networks) are always a work in progress. That is, many researchers and developers are competing at all times to provide new and better apparatus and methods, including software, for enhancing the operation of such networks.
In general the most sought-after improvements in data packet networks are those that provide higher speed in routing (more packets per unit time) and better reliability and fidelity in messaging. What is generally needed are router apparatus and methods increasing the rates at which packets may be processed in a router.
As is well-known in the art, packet routers are computerized machines wherein data packets are received at any one or more of typically multiple ports, processed in some fashion, and sent out at the same or other ports of the router to continue on to downstream destinations. As an example of such computerized operations, keeping in mind that the Internet is a vast interconnected network of individual routers, individual routers have to keep track of which external routers to which they are connected by communication ports, and of which of alternate routes through the network are the best routes for incoming packets. Individual routers must also accomplish flow accounting, with a flow generally meaning a stream of packets with a common source and end destination. A general desire is that individual flows follow a common path. The skilled artisan will be aware of many such requirements for computerized processing.
Typically a router in the Internet network will have one or more Central Processing Units (CPUs) as dedicated microprocessors for accomplishing the many computing tasks required. In the current art at the time of the present application, these are single-streaming processors; that is, each processor is capable of processing a single stream of instructions. In some cases developers are applying multiprocessor technology to such routing operations. The present inventors have been involved for some time in development of dynamic multistreaming (DMS) processors, which processors are capable of simultaneously processing multiple instruction streams. One preferred application for such processors is in the processing of packets in packet networks like the Internet.
In a data-packet processor, a configurable queuing system for packet accounting during processing is known to the inventor and disclosure of same is referenced herein as Ser. No. 09/737,375 in the cross-reference section of this specification. The queuing and accounting system has a plurality of queues arranged in one or more clusters, an identification mechanism for creating a packet identifier for arriving packets, insertion logic for inserting packet identifiers into queues and for determining into which queue to insert a packet identifier, and selection logic for selecting packet identifiers from queues to initiate processing of identified packets, downloading of completed packets, or for re-queuing of the selected packet identifiers.
A portion of memory in the above-described system is called packet memory. The packet memory is the memory where data packets reside during processing before they can be downloaded by a packet management unit (PMU) to an output network interface (ONI). During processing packet data may be altered. A portion of the packet memory is termed the local packet memory (LPM), and directly managed by hardware in the PMU instead of by software.
Whenever a data packet has been processed and is ready to be downloaded from memory, the processing core or streaming processor unit (SPU) sends a command (PKTDONE) to the PMU. This command contains, among other information, a packet identifier (typically a number) of the packet that is ready to be downloaded. The PMU will then proceed with the download of this packet if it resides in LPM. If not, SPU software operating through a system interface unit (SIU) will download the packet upon request from an external portion of packet memory (EPM).
Data packet processing requires loading of contexts with pertinent information to control the processing, and in a system known to the inventor, processing is accomplished by a Dynamic Multistreaming Processor running, in one embodiment, 8 streams. In this system there are at least eight contexts for processing data packets, and in some cases one or more redundant contexts allowing background loading.
The contexts are located physically within the streaming processor unit (SPU) core, which also has associated functional units required for processing. In a preferred embodiment each context can have a state of either PMU-owned or SPU-owned. When information is being preloaded into a context it is considered PMU-owned. When a stream is running with a context, the context is considered SPU-owned.
A context-selection mechanism is known to the inventor and is provided within PMU hardware in the processor, operating as part of or in conjunction with a register transfer unit (RTU), and the context-selection mechanism is adapted for selecting a best context from a pool of available contexts for processing a data packet. The context selection mechanism comprises an interface for communicating with other elements of a multi-streaming processor; circuitry for computing input data into a result value according to logic rule and for selecting a context based on the computed value, and a loading mechanism for preloading packet information for the selected packet into the selected context for subsequent processing.
The computation of the input data functions to enable identification and selection of a best context for processing a data packet according to the logic rule at the instant time such that a large number of context selections over a period of time acts to balance load pressure on functional units associated with stream clusters.
There is a possibility during operation that all of the contexts of a processor could be PMU-owned at any given time. Moreover some of the contexts may be undergoing pre-load operations while others are sitting idle waiting for PMU selection, preload, and release (activation). In this situation none of the contexts are available for processing because none are SPU-owned.
The lack of any contexts available for processing also means that no processing may occur involving servicing of any pending interrupts that may have been generated while all of the contexts are PMU-owned. It is noted herein that an interrupt may be PMU or SPU generated. In the case of all contexts being PMU-owned when there are one or more interrupts pending, then the SPU must wait until the PMU releases (activates) a context. This presents a problem in that it may take a long time to start executing an interrupt service routine for an interrupt. The problem is magnified if the interrupt that has been generated is due to a high priority event that needs immediate processing.
Therefore what is clearly needed is a mechanism for limiting the possibility of PMU ownership of all contexts to a very limited span of time. A mechanism such as this would insure a relatively small latency in response time for processing a potentially critical interrupt routine or routines that may be pending.
SUMMARY OF THE INVENTION
In a preferred embodiment of the present invention, a logic system in a data packet processor is provided for selecting and releasing a context among a plurality of such contexts. The selected and released context is dedicated for enabling the processing of interrupt service routines corresponding to interrupts generated and pending for service. The system comprises, a first determination logic for determining control status of all of the contexts, a second determination logic for determining if a context is idle or not, a selection logic for selecting a context and a context release mechanism for releasing the selected context. Determination by the logic system that all contexts are singularly owned by an entity not responsible for packet processing and that at least one of the contexts is idle, triggers immediate selection and release of an idle one of the at least one idle contexts to an entity responsible for packet processing.
In a preferred embodiment, the data packet processor is part of a data packet routing system connected to the Internet network. Also in a preferred embodiment, the first and second determination logics and the selection logic and release mechanism are integrated on one hardware device. In one aspect, the hardware device is a register transfer unit. In another embodiment, the logic system further comprises a memory marker pointing to a memory address of a pre-defined instruction thread for invoking a stream to run in the released context. In all aspects, the selected and released context is dedicated to the servicing of pending interrupts.
In one embodiment, the memory containing the pre-defined thread is a cache memory used for containing program instructions. In another embodiment, the memory containing the pre-defined thread is a dedicated memory section of a processing core, the core also containing the contexts. In one aspect, a priority scheme is used to select an idle context in the case of more than one idle context. In this aspect priority rules used are based on prediction of likely context assignment released to the processing entity after release of the context for servicing interrupts.
In another aspect of the present invention, a method is provided for selecting and releasing a context among a plurality of contexts to a processing entity for processing interrupt service routines associated with interrupts generated during data packet processing. The method comprises the steps of, (a) determining that all of the contexts are not under control of the processing entity at the time of selection and release, (b) determining that there is at least one idle context within the singularly owned group of contexts, (c) selecting one of the at least one idle contexts for release and (d) releasing the selected context to a packet processing entity charged with processing the service routines.
In a preferred embodiment, the data packet processing is performed by a processor of a data packet router connected to the Internet network. In one aspect of the method in step (a) the determination of control status of the contexts is made by logic on a hardware device. In another aspect of the method in step (b) the determination of idle status of the contexts is made by logic on a hardware device. In a preferred aspect of the method, steps (a) and (b) are enabled by a single hardware device containing the appropriate logics. In this preferred aspect steps (c) and (d) are also enabled by the same hardware device containing the appropriate logics.
In another aspect of the method in step (d), a memory marker is sent along with release notification, the marker pointing to a location in memory where a pre-defined instruction thread exists to be used by the processing entity for invoking a stream to run in the released context.
In still another aspect of the present invention, a method is provided for processing service routines corresponding to generated interrupts in a data packet processing system using a dedicated context selected among a plurality of contexts not under control of the processor. The method comprises the steps of, (a) receiving, at the processor, notification of a selected context about to be released and memory marker pointing to an instruction thread in memory, (b) fetching the instruction thread designated by the received marker, (c) executing the instruction thread in the released context now under processor control, (d) detecting at least one pending interrupt that requires processing and (e) running the corresponding service routine or routines that satisfies the at least one interrupt.
In a preferred embodiment, the data processing system is a data packet router connected to the Internet network. In all aspects of the method in step (a), the instruction thread has at least one instruction invoking a processing stream. In one aspect of the method in step (a), the memory marker is a program counter number of the beginning location of the thread in memory. In one embodiment, the memory is an instruction cache memory containing a program of instructions. In another embodiment, the memory is a dedicated memory in the processing core, the core also containing the contexts.
In another aspect of the method in step (b) fetching the instruction thread occurs before the context is actually under the control of the processor. In all aspects of the method in step (c) the instruction thread contains at least one instruction invoking a stream. In one aspect, the instruction thread further contains an instruction for checking for pending interrupts. In yet another aspect in step (d), the interrupts are taken by priority in the event of more than one pending. In all aspects of the method, steps (a) through (e) are completed when no other contexts are under control of the processor.
Now, for the first time, a mechanism is provided for limiting the event of non-processor ownership of all contexts of the processor to a very small amount of time. A mechanism such as this insures a relatively small latency in response time for processing a potentially critical interrupt routine or routines that may be pending.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a simplified block diagram showing relationship of functional areas of a DMS processor in a preferred embodiment of the present invention.
FIG. 2 is a block diagram of the DMS processor of FIG. 1 showing additional detail.
FIG. 3 is a block diagram illustrating uploading of data into the LPM or EPM in an embodiment of the invention.
FIG. 4a is a diagram illustrating determination and allocation for data uploading in an embodiment of the invention.
FIG. 4b is a diagram showing the state that needs to be maintained for each of the four 64 KB blocks.
FIGS. 5a and 5b illustrate an example of how atomic pages are allocated in an embodiment of the present invention.
FIGS. 6a and 6b illustrate how memory space is efficiently utilized in an embodiment of the invention.
FIG. 7 is a top-level schematic of the blocks of the XCaliber PMU unit involved in the downloading of a packet.
FIG. 8 is a diagram illustrating the phenomenon of packet growth and shrink.
FIG. 9 is a block diagram showing high-level communication between the QS and other blocks in the PMU and SPU in an embodiment of the present invention.
FIG. 10 is a table illustrating six different modes in an embodiment of the invention into which the QS can be configured.
FIG. 11 is a diagram illustrating generic architecture of the QS of FIGS. 2 and 7 in an embodiment of the present invention.
FIG. 12 is a table indicating coding of the outbound DeviceId field in an embodiment of the invention.
FIG. 13 is a table illustrating priority mapping for RTU transfers in an embodiment of the invention.
FIG. 14 is a table showing allowed combinations of Active, Completed, and Probed bits for a valid packet in an embodiment of the invention.
FIG. 15 is a Pattern Matching Table in an embodiment of the present invention.
FIG. 16 illustrates the format of a mask in an embodiment of the invention.
FIG. 17 shows an example of a pre-load operation using the mask in FIG. 16.
FIG. 18 illustrates shows the PMU Configuration Space in an embodiment of the present invention.
FIGS. 19a, 19b and 19c are a table of Configuration register Mapping.
FIG. 20 is an illustration of a PreloadMaskNumber configuration register.
FIG. 21 illustrates a PatternMatchingTable in a preferred embodiment of the present invention.
FIG. 22 illustrates a VirtualPageEnable configuration register in an embodiment of the invention.
FIG. 23 illustrates a ContextSpecificPatternMatchingMask configuration register in an embodiment of the invention.
FIG. 24 illustrates the MaxActivePackets configuration register in an embodiment of the present invention.
FIG. 25 illustrates the TimeCounter configuration register in an embodiment of the present invention.
FIG. 26 illustrates the StatusRegister configuration register in an embodiment of the invention.
FIG. 27 is a schematic of a Command Unit and command queues in an embodiment of the present invention.
FIG. 28 is a table showing the format of command inserted in command queues in an embodiment of the present invention.
FIG. 29 is a table showing the format for responses that different blocks generate back to the CU in an embodiment of the invention.
FIG. 30 shows a performance counter interface between the PMU and the SIU in an embodiment of the invention.
FIG. 31 shows a possible implementation of internal interfaces among the different units in the PMU in an embodiment of the present invention.
FIG. 32 is a diagram of a BypassHooks configuration register in an embodiment of the invention.
FIG. 33 is a diagram of an InternalStateWrite configuration register in an embodiment of the invention.
FIGS. 34 39 comprise a table listing events related to performance counters in an embodiment of the invention.
FIG. 40 is a table illustrating the different bypass hooks implemented in the PMU in an embodiment of the invention.
FIG. 41 is a table relating architecture and hardware blocks in an embodiment of the present invention.
FIGS. 42 45 comprise a table showing SPU-PMU Interface in an embodiment of the invention.
FIGS. 46 49 comprise a table showing SIU-PMU Interface in an embodiment of the invention.
FIG. 50 is a block diagram illustrating various components involved in context selection and activation according to an embodiment of present invention.
FIG. 51 is a process flow chart illustrating logic steps for selecting and releasing a context according to an embodiment of the present invention.
FIG. 52 is a process flow chart illustrating steps for receiving a released context at the SPU and utilizing the context for interrupt servicing according to an embodiment of the present invention.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
In the provisional patent application Ser. No. 60/181,364 referenced above there is disclosure as to the architecture of a DMS processor, termed by the inventors the XCaliber processor, which is dedicated to packet processing in packet networks. Two extensive diagrams are provided in the referenced disclosure, one, labeled NIO Block Diagram, shows the overall architecture of the XCaliber processor, with input and output ports to and from a packet-handling ASIC, and the other illustrates numerous aspects of the Generic Queue shown in the NIO diagram. The NIO system in the priority document equates to the Packet Management Unit (PMU) in the present specification. It is to the several aspects of the generic queue that the present application is directed.
FIG. 1 is a simplified block diagram of an XCaliber DMS processor 101 with a higher-level subdivision of functional units than that shown in the NIO diagram of the priority document. In FIG. 1 XCaliber DMS processor 101 is shown as organized into three functional areas. An outside System Interface Unit (SIU) area 107 provides communication with outside devices, that is, external to the XCaliber processor, typically for receiving and sending packets. Inside, processor 101 is divided into two broad functional units, a Packet Management Unit (PMU) 103, equating to the NIO system in the priority document mentioned above, and a Stream Processor Unit (SPU) 107. The functions of the PMU include accounting for and managing all packets received and processed. The SPU is responsible for all computational tasks.
The PMU is a part of the XCaliber processor that offloads the SPU from performing costly packet header accesses and packet sorting and management tasks, which would otherwise seriously degrade performance of the overall processor.
Packet management is achieved by (a) Managing on-chip memory allocated for packet storage, (b) Uploading, in the background, packet header information from incoming packets into different contexts (context registers, described further below) of the XCaliber processor, (c) Maintaining, in a flexible queuing system, packet identifiers of the packets currently in process in the XCaliber.
The described packet management and accounting tasks performed by the PMU are performed in parallel with processing of packets by the SPU core. To implement this functionality, the PMU has a set of hardware structures to buffer packets incoming from the network, provide them to the SPU core and, if needed, send them out to the network when the processing is completed. The PMU features a high degree of programmability of several of its functions, such as configuration of its internal packet memory storage and a queuing system, which is a focus of the present patent application.
FIG. 2 is a block diagram of the XCaliber processor of FIG. 1 showing additional detail. SIU 107 and SPU 105 are shown in FIG. 2 as single blocks with the same element numbers used in FIG. 1. The PMU is shown in considerably expanded detail, however, with communication lines shown between elements.
In FIG. 2 there is shown a Network/Switching Fabric Interface 203 which is in some cases an Application Specific Integrated Circuit (ASIC) dedicated for interfacing directly to a network, such as the Internet for example, or to switching fabric in a packet router, for example, receiving and transmitting packets, and transacting the packets with the XCaliber processor. In this particular instance there are two in ports and two out ports communicating with processor 201. Network in and out interface circuitry 205 and 215 handle packet traffic onto and off the processor, and these two interfaces are properly a part of SIU 107, although they are shown separately in FIG. 2 for convenience.
Also at the network interface within the PMU there are, in processor 201, input and output buffers 207 and 217 which serve to buffer the flow of packets into and out of processor 201.
Referring again to FIG. 1, there is shown a Packet Management Unit (PMU) 103, which has been described as a unit that offloads the requirement for packet management and accounting from the Stream Processing Unit. This is in particular the unit that has been expanded in FIG. 2, and consists substantially of Input Buffer (IB) 207, Output Buffer (OB) 217, Paging Memory Management Unit (PMMU) 209, Local Packet Memory (LPM) 219, Command Unit (CU) 213, Queueing System (QS) 211, Configuration Registers 221, and Register Transfer Unit (RTU) 227. The communication paths between elements of the PMU are indicated by arrows in FIG. 2, and further description of the elements of the PMU is provided below, including especially QS 211, which is a particular focus of the present patent application.
Overview of PMU
Again, FIG. 2 shows the elements of the PMU, which are identified briefly above. Packets arrive to the PMU in the present example through a 16-byte network input interface. In this embodiment packet data arrives to the PMU at a rate of 20 Gbps (max). At an operating speed of 300 MHz XCaliber core frequency, an average of 8 bytes of packet data are received every XCaliber core cycle. The incoming data from the network input interface is buffered in InBuffer (IB) block 207. Network interface 205 within XCaliber has the capability of appending to the packet itself the size of the packet being sent, in the event that the external device has not been able to append the size to the packet before sending the packet. Up to 2 devices can send packet data to XCaliber at (10 Gbps per device), and two in ports are shown from an attached ASIC. It is to be understood that the existence and use of the particular ASIC is exemplary, and packets could be received from other devices. Further, there may be in some embodiments more or fewer than the two in ports indicated.
Packet Memory Manager Unit (PMMU) 209 decides whether each incoming packet has to be stored into on-chip Local Packet Memory (LPM) 219, or, in the case that, for example, no space exists in the LPM to store it, may decide to either send the packet out to an External Packet Memory (EPM) not shown through the SIU block, or may decide to drop the packet. In case the packet is to be stored in the LPM, the PMMU decides where to store the packet and generates all the addresses needed to do so. The addresses generated correspond in a preferred embodiment to 16-byte lines in the LPM, and the packet is consecutively stored in this memory.
In the (most likely) case that the PMMU does not drop the incoming packet, a packet identifier is created, which includes a pointer (named packetpage) to a fixed-size page in packet memory where the packet has started to be stored. The identifier is created and enqueued into Queuing System (QS) block 211. The QS assigns a number from 0 to 255 (named packetNumber) to each new packet. The QS sorts the identifiers of the packets alive in XCaliber based on the priority of the packets, and it updates the sorting when the SPU core notifies any change on the status of a packet. The QS selects which packet identifiers will be provided next to the SPU. Again, the QS is a particular focus of the present application.
Register Transfer Unit (RTU) block 227, upon receiving a packet identifier (packetpage and packetNumber) from the QS, searches for an available context (229, FIG. 2) out of 8 contexts that XCaliber features in a preferred embodiment. For architectural and description purposes the contexts are considered a part of a broader Stream Processing Unit, although the contexts are shown in FIG. 2 as a separate unit 229.
In the case that no context is available, the RTU has the ability to notify the SPU about this event through a set of interrupts. In the case that a context is available, the RTU loads the packet identifier information and some selected fields of the header of the packet into the context, and afterwards it releases the context (which will at that time come under control of the SPU. The RTU accesses the header information of the packet through the SIU, since the packet could have been stored in the off-chip EPM.
Eventually a stream in the SPU core processes the context and notifies the QS of this fact. There are, in a preferred embodiment, eight streams in the DMS core. The QS then updates the status of the packet (to completed), and eventually this packet is selected for downloading (i.e. the packet data of the corresponding packet is sent out of the XCaliber processor to one of the two external devices).
When a packet is selected for downloading, the QS sends the packetPage (among other information) to the PMMU block, which generates the corresponding line addresses to read the packet data from the LPM (in case the packet was stored in the on-chip local memory) or it will instruct the SIU to bring the packet from the external packet memory to the PMU. In any case, the lines of packet data read are buffered into the OutBuffer (OB) block, and from there sent out to the device through the 16-byte network output interface. This interface is independent of its input counterpart. The maximum aggregated bandwidth of this interface in a preferred embodiment is also 20 Gbps, 10 Gbps per output device.
CommandUnit (CU) 213 receives commands sent by SPU 105. A command corresponds to a packet instruction, which are in many cases newly defined instructions, dispatched by the SPU core. These commands are divided into three independent types, and the PMU can execute one command per type per cycle (for a total of up to 3 commands per cycle). Commands can be load-like or store-like (depending on whether the PMU provides a response back to the SPU or not, respectively).
A large number of features of the PMU are configured by the SPU through memory-mapped configuration registers 221. Some such features have to be programmed at boot time, and the rest can be dynamically changed. For some of the latter, the SPU has to be running in a single-thread mode to properly program the functionality of the feature. The CU block manages the update of these configuration registers.
The PMU provides a mechanism to aid in flow control between ASIC 203 and XCaliber DMS processor 201. Two different interrupts are generated by the PMU to SPU 105 when LPM 219 or QS 211 are becoming full. Software controls how much in advance the interrupt is generated before the corresponding structure becomes completely full. Software can also disable the generation of these interrupts.
LPM 219 is also memory mapped, and SPU 105 can access it through the conventional load/store mechanism. Both configuration registers 221 and LPM 219 have a starting address (base address) kept by SIU 107. Requests from SPU 105 to LPM 219 and the configuration space arrive to the PMU through SIU block 107. The SIU is also aware of the base address of the external packet memory.
In Buffer (IB)
Packet data sent by an external device arrives to the PMU through the network input interface 205 at an average rate of 8 bytes every XCaliber core cycle in a preferred embodiment. 113 block 207 of the PMU receives this data, buffers it, and provides it, in a FIFO-like fashion, to LPM 219 and in some cases also to the SIU (in case of a packet overflow, as explained elsewhere in this specification.
XCaliber DMS processor 201 can potentially send/receive packet data to/from up to 2 independent devices. Each device is tagged in Sru 107 with a device identifier, which is provided along with the packet data. When one device starts sending data from a packet, it will continue to send data from that very same packet until the end of the packet is reached or a bus error is detected by the SIU.
In a preferred embodiment the first byte of a packet always starts at byte 0 of the first 16 bytes sent of that packet. The first two bytes of the packet specify the size in bytes of the packet (including these first two bytes). These two bytes are always appended by the SIU if the external device has not appended them. If byte k in the 16-byte chunk is a valid byte, bytes 0 . . . k-1 are also valid bytes. This can be guaranteed since the first byte of a packet always starts at byte 0. Note that no valid bits are needed to validate each byte since a packet always starts at byte 0 of the 16-byte chunk, and the size of the packet is known up front (in the first two bytes).
The network interface provides, at every core clock, a control bit specifying whether the 16-byte chunk contains, at least, one valid byte. The valid data received from the network input interface is organized in buffer 207. This is an 8-entry buffer, each entry holding the 16-bytes of data plus the control bits associated to each chunk. PMMU 209 looks at the control bits in each entry and determines whether a new packet starts or to which of the (up to) two active packets the data belongs to, and it acts accordingly.
The 16-byte chunks in each of the entries in IB 207 are stored in LPM 219 or in the EPM (not shown). It is guaranteed by either the LPM controller or the SIU that the bandwidth to write into the packet memory will at least match the bandwidth of the incoming packet data, and that the writing of the incoming packet data into the packet memory will have higher priority over other accesses to the packet memory.
In some cases IB 207 may get full because PMMU 209 may be stalled, and therefore the LPM will not consume any more data of the IB until the stall is resolved. Whenever the IB gets full, a signal is sent to network input interface 205, which will retransmit the next 16-byte chunk as many times as needed until the IB accepts it. Thus, no packet data is lost due to the IB getting full.
Out Buffer (OB)
Network output interface 215 also supports a total aggregated bandwith of 20 Gbps (10 Gbps per output device), as does the Input Interface. At 300 MHz XCaliber clock frequency, the network output interface accepts in average 8 bytes of data every XCaliber cycle from the OB block, and sends it to one of the two output devices. The network input and output interfaces are completely independent of each other.
Up to 2 packets (one per output device) can be simultaneously sent. The device to which the packet is sent does not need to correspond to the device that sent the packet in. The packet data to be sent out will come from either LPM 219 or the EPM (not shown).
For each of the two output devices connected at Network Out interface 215, PMMU 209 can have a packet ready to start being downloaded, a packet being downloaded, or no packet to download. Every cycle PMMU 209 selects the highest packet across both output devices and initiates the download of 16 bytes of data for that packet. Whenever the PMMU is downloading packet data from a packet to an output device, no data from a different packet will be downloaded to the same device until the current packet is completely downloaded.
The 16-byte chunks of packet data read from LPM 219 (along with some associated control information) are fed into one of the two 8-entry buffers (one per device identifier). The contents of the head of one of these buffers is provided to the network output interface whenever this interface requests it. When the head of both buffers is valid, the OB provides the data in a round robin fashion.
Differently than the network input interface, in the 16-byte chunk sent to the network output interface it can not be guaranteed that if a byte k is valid, then bytes 0 . . . k-1 are valid as well. The reason for this is that when the packet is being sent out, it does not need to start at byte 0 of the 16-byte chunk in memory. Thus, for each 16-byte chunk of data that contains the start of the packet to be sent out, OB 217 needs to notify the network interface where the first valid byte of the chunk resides. Moreover, since the first two bytes of the packet contain the size of the packet in bytes, the network output interface has the information to figure out where the last valid byte of the packet resides within the last 16-byte chunk of data for that packet. Moreover, OB 217 also provides a control bit that informs SIU 107 whether it needs to compute CRC for the packet, and if so, which type of CRC. This control bit is provided by PMMU 209 to OB 217.
Paging Memory Management Unit (PMMU)
The packet memory address space is 16 MB. Out of the 16 MB, the XCaliber processor features 256 KB on-chip. The rest (or a fraction) is implemented using external storage.
The packet memory address space can be mapped in the TLB of SPU 105 as user or kernel space, and as cachable or uncachable. In case it is mapped cachable, the packet memory space is cached (write-through) into an L1 data cache of SPU 105, but not into an L2 cache.
A goal of PMMU 209 is to store incoming packets (and SPU-generated packets as well) into the packet memory. In case a packet from the network input interface fits into LPM 219, PMMU 209 decides where to store it and generates the necessary write accesses to LPM 219; in case the packet from the network input interface is going to be stored in the EPM, SPU 105 decides where in the EPM the packet needs to be stored and SIU 107 is in charge of storing the packet. In either case, the packet is consecutively stored and a packet identifier is created by PM 209 and sent to QS 211.
SPU 105 can configure LPM 219 so packets larger than a given size will never be stored in the LPM. Such packets, as well as packets that do not fit into the LPM because lack of space, are sent by PMMU 209 to the EPM through SIU 107. This is a mechanism called overflow and is configured by the SPU for the PMU to do so. If no overflow of packets is allowed, then the packet is dropped. In this case, PMMU 209 interrupts the SPU (again, if configured to do so).
Uploading a Packet into Packet Memory
Whenever there is valid data at the head of IB 205, the corresponding device identifier bit is used to determine to which packet (out of the two possible packets being received) the data belongs. When the network input interface starts sending data of a new packet with device identifier d, all the rest of the data will eventually arrive with that same device identifier d unless an error is notified by the network interface block. The network input interface can interleave data from two different device identifiers, but in a given cycle only data from one device is received by IB 207.
When a packet needs to be stored into LPM 219, PMMU block 209 generates all the write addresses and write strobes to LPM 219. If the packet needs to be stored into the EPM, SIU 107 generates them.
FIG. 3 is a diagram illustrating uploading of data into either LPM 219 or the EPM, which is shown in FIG. 3 as element 305, but not shown in FIG. 2. The write strobe to the LPM or EPM will not be generated unless the header of the IB has valid data. Whenever the write strobe is generated, the 16-byte chunk of data at the head of the IB (which corresponds to a LPM line) is deleted from the IB and stored in the LPM or EPM. The device identifier bit of the head of the IB is used to select the correct write address out of the 2 address generators (one per input device).
In the current embodiment only one incoming packet can be simultaneously stored in the EPM by the SIU (i.e. only one overflow packet can be handled by the SIU at a time). Therefore, if a second packet that needs to be overflowed is sent by the network input interface, the data of this packet will be thrown away (i.e. the packet will be dropped).
A Two Byte Packet-size Header
The network input interface always appends two bytes to a packet received from the external device (unless this external device already does so, in which case the SIU will be programmed not to append them). This appended data indicates the size in bytes of the total packet, including the two appended bytes. Thus, the maximum size of a packet that is processed by the XCaliber DMS processor is 65535 bytes including the first two bytes.
The network output interface expects that, when the packet is returned by the PMU (if not dropped during its processing), the first two bytes also indicate the size of the processed packet. The size of the original packet can change (the packet can increase or shrink) as a result of processing performed by the XCaliber processor. Thus, if the processing results in increasing the size beyond 64K-1 bytes, it is the responsibility of software to chop the packet into two different smaller packets.
The PMU is more efficient when the priority of the packet being received is known up front. The third byte of the packet will be used for priority purpose if the external device is capable of providing this information to the PMU. The software programs the PMU to either use the information in this byte or not, which is does through a boot-time configuration register named Log2InQueues.
Dropping a packet
A packet completely stored in either LPM 219 or EPM 305 will be dropped only if SPU 105 sends an explicit command to the PMU to do so. No automatic dropping of packets already stored in the packet memory can occur. In other words, any dropping algorithm of packets received by the XCaliber DMS processor is implemented in software.
There are, however, several situations wherein the PMU may drop an incoming packet. These are (a) The packet does not fit in the LPM and the overflow of packets is disabled, (b) The total amount of bytes received for the packet is not the same as the number of bytes specified by the ASIC in the first two bytes of the ASIC-specific header, or (c) A transmission error has occurred between the external device and the network input interface block of the SIU. The PMMU block is notified about such an error.
For each of the cases (a), (b) and (c) above, an interrupt is generated to the SPU. The software can disable the generation of these interrupts using AutomaticPacketDropIntEnable, PacketErrorIntEnable on-the-fly configuration flags.
Virtual Pages
An important process of PMMU 209 is to provide an efficient way to consecutively store packets into LPM 219 with as little memory fragmentation as possible. The architecture in the preferred embodiment provides SPU 105 with a capability of grouping, as much as possible, packets of similar size in the same region of LPM 219. This reduces overall memory fragmentation.
To implement the low-fragmentation feature, LPM 219 is logically divided into 4 blocks of 64 KB bytes each. Each block is divided into fixed atomic pages of 256 bytes. However, every block has virtual pages that range from 256 bytes up to 64 KB, in power-of-2 increments. Software can enable/disable the different sizes of the virtual pages for each of the 4 blocks using an on-the-fly configuration register named VirtualPageEnable. This allows configuring some blocks to store packets of up to a certain size.
The organization and features of the PMU assure that a packet of size s will never be stored in a block with a maximum virtual page size less than s. However, a block with a minimum virtual page size of r will accept packets of size smaller than r. This will usually be the case, for example, in which another block or blocks are configured to store these smaller packets, but is full.
Software can get ownership of any of the four blocks of the LPM, which implies that the corresponding 64 KB of memory will become software managed. A configuration flag exists per block (SoftwareOwned) for this purpose. The PMMU block will not store any incoming packet from the network input interface into a block in the LPM with the associated SoftwareOwned flag asserted. Similarly, the PMMU will not satisfy a GetSpace operation (described elsewhere) with memory of a block with its SoftwareOwned flag asserted. The PMMU, however, is able to download any packet stored by software in a software-owned block.
The PMMU logic determines whether an incoming packet fits in any of the blocks of the LPM. If a packet fits, the PMMU decides in which of the four blocks (since the packet may fit in more than one block), and the first and last atomic page that the packet will use in the selected block. The atomic pages are allocated for the incoming packet. When packet data stored in an atomic page has been safely sent out of the XCaliber processor through the network output interface, the corresponding space in the LPM can be de-allocated (i.e. made available for other incoming packets).
The EPM, like the LPM is also logically divided into atomic pages of 256 bytes. However, the PMMU does not maintain the allocation status of these pages. The allocation status of these pages is managed by software. Regardless of where the packet is stored, the PMMU generates an offset (in atomic pages) within the packet memory to where the first data of the packet is stored. This offset is named henceforth packetPage. Since the maximum size of the packet memory is 16 MB, the packetPage is a 16-bit value.
As soon as the PMMU safely stores the packet in the LPM, or receives acknowledgement from SIU 107 that the last byte of the packet has been safely stored in the EPM, the packetPage created for that packet is sent to the QS. Operations of the QS are described in enabling detail below.
Generating the packetPage offset
The PMMU always monitors the device identifier (deviceId) associated to the packet data at the head of the IB. If the deviceId is not currently active (i.e. the previous packet sent by that deviceId has been completely received), that indicates that the head of the IB contains the first data of a new packet. In this case, the first two bytes (byte0 and byte1 in the 16-byte chunk) specify the size of the packet in bytes. With the information of the size of the new incoming packet, the PMMU determines whether the packet fits into LPM 219 and, if it does, in which of the four blocks it will be stored, plus the starting and ending atomic pages within that block.
The required throughput in the current embodiment of the PMMU to determine whether a packet fits in LPM 219 and, if so, which atomic pages are needed, is one packet every two cycles. One possible two-cycle implementation is as follows: (a) The determination happens in one cycle, and only one determination happens at a time (b) In the cycle following the determination, the atomic pages needed to store the packet are allocated and the new state (allocated/de-allocated) of the virtual pages are computed. In this cycle, no determination is allowed.
FIG. 4a is a diagram illustrating determination and allocation in parallel for local packet memory. The determination logic is performed in parallel for all of the four 64 KB blocks as shown.
FIG. 4b shows the state that needs to be maintained for each of the four 64 KB blocks. This state, named AllocationMatrix, is recomputed every time one or more atomic pages are allocated or de-allocated, and it is an input for the determination logic. The FitsVector and IndexVector contain information computed from the AllocationMatrix.
AllocationMatrix[VPSize] [VPIndex] indicates whether virtual page number VPIndex of size VPSize in bytes is already allocated or not. FitsVector[VPSize] indicates whether the block has at least one non-allocated virtual page of size VPSize. If FitsVector[VPSize] is asserted, IndexVector[VPSize] vector contains the index of a non-allocated virtual page of size VPSize.
The SPU programs which virtual page sizes are enabled for each of the blocks. The EnableVector[VPSize] contains this information. This configuration is performed using the VirtualPageEnable on-the-fly configuration register. Note that the AllocationMatrix[ ] [], FitsVector[ ], IndexVector[ ] and EnableVector[ ] are don't cares if the corresponding SoftwareOwned flag is asserted.
In this example the algorithm for the determination logic (for a packet of size s bytes) is as follows: 1) Fits logic: check, for each of the blocks, whether the packet fits in or not. If it fits, remember the virtual page size and the number of the first virtual page of that size.
TABLE-US-00001 For All Block j Do (can be done in parallel): Fits[j] = (s <= VPSize) AND FitsVector[VPSize] AND Not SoftwareOwned where VPSizeis the smallest possible page size. If (Fits[j]) VPIndex[j] = IndexVector[VPSize] MinVPS[j] = VPSize Else MinVPS[j] = <Infinity>
2) Block selection: the blocks with the smallest virtual page (enabled or not) that is able to fit the packet in are candidates. The block with the smallest enabled virtual page is selected.
TABLE-US-00002 If Fits[j] = FALSE for all j Then <Packet does not fit in LPM> packetPage = OverflowAddress>>8 Else C = set of blocks with smallest MinVPS AND Fits[MinVPS] B = block# in C with the smallest enabled virtual page (if more than one exists, pick the smallest block number) If one or more blocks in C have virtual pages enabled Then Index = VPIndex[B] VPSize = MinVPS[B] NumAPs = ceil(S/256) packetPage = (B*64KB + Index*VPSize)>> Else <Packet does not fit in LPM> packetPage = OverflowAddress>>8
If the packet fits in the LPM, the packetPage created is then the atomic page number within the LPM (there are up to 1K different atomic pages in the LPM) into which the first data of the packet is stored. If the packet does not fit, then the packetPage is the contents of the configuration register OverflowAddress right-shifted 8 bits. The packet overflow mechanism is described elsewhere in this specification, with a subheader "Packet overflow".
In the cycle following the determination of where the packet will be stored, the new values of the AllocationMatrix, FitsVector and IndexVector must be recomputed for the selected block. If FitsVector[VPSize] is asserted, then IndexVector[ VPSize] is the index of the largest non-allocated virtual page possible for the corresponding virtual page size. If FitsVector[VPSize] is de-asserted, then IndexVector[VPSize] is undefined.
The number of atomic pages needed to store the packet is calculated (NumAPs) and the corresponding atomic pages are allocated. The allocation of the atomic pages for the selected block (B) is done as follows: 1. The allocation status of the atomic pages in AllocationMatrix[APsize] [j . . . k], j being the first atomic page and k the last one (k-j+1=NumAPs), are set to allocated. 2. The allocation status of the virtual pages in AllocationMatrix[r] [s] are updated following the mesh structure in FIG. 4b. (a 2.sup.k+1-byte virtual page will be allocated if any of the two 2.sup.k-byte virtual pages that it is composed of is allocated).
When the packetPage has been generated, it is sent to the QS for enqueueing. If the QS is full (very rare), it will not be able to accept the packetPage being provided by the PMMU. In this case, the PMMU will not be able to generate a new packetPage for the next new packet. This puts pressure on the IB, which might get full if the QS remains full for several cycles.
The PMMU block also sends the queue number into which the QS has to store the packetPage. How the PMMU generates this queue number is described below in sections specifically allocated to the QS.
Page Allocation Example
FIGS. 5a and 5b illustrate an example of how atomic pages are allocated. For simplicity, the example assumes 2 blocks (0 and 1) of 2 KB each, with an Atomic page size of 256 bytes, and both blocks have their SoftwareOwned flag de-asserted. Single and double cross-hatched areas represent allocated virtual pages (single cross-hatched pages correspond to the pages being allocated in the current cycle). The example shows how the pages get allocated for a sequence of packet sizes of 256, 512, 1K and 512 bytes. Note that, after this sequence, a 2K-byte packet, for example, will not fit in the example LPM.
Whenever the FitsVector[VPSize] is asserted, the IndexVector[VPSize] contains the largest non-allocated virtual page index for virtual page size VPSize. The reason for choosing the largest index is that the memory space is better utilized. This is shown in FIGS. 6a and 6b, where two 256-byte packets are stored in a block. In scenario A, the 256-byte virtual page is randomly chosen, whereas in scenario B, the largest index is always chosen. As can be seen, the block in scenario A only allows two 512-byte virtual pages, whereas the block in scenario B allows three. Both, however, allow the same number of 256-byte packets since this is the smallest allocation unit. Note that the same effect is obtained by choosing the smallest virtual page index number all the time.
Packet Overflow
The only two reasons why a packet cannot be stored in the LPM are (a) that the size of the packet is larger than the maximum virtual page enabled across all 4 blocks; or (b) that the size of the packet is smaller than or equal to the maximum virtual page enabled but no space could be found in the LPM.
When a packet does not fit into the LPM, the PMMU will overflow the packet through the SIU into the EPM. To do so, the PMMU provides the initial address to the SIU (16-byte offset within the packet memory) to where the packet will be stored. This 20-bit address is obtained as follows: (a) The 16 MSB bits correspond to the 16 MSB bits of the OverflowAddress configuration register (i.e. the atomic page number within the packet memory). (b) The 4 LSB bits correspond to the HeaderGrowthOffset configuration register. The packetPage value (which will be sent to the QS) for this overflowed packet is then the 16 MSB bits of the OverflowAddress configuration register.
If the on-the-fly configuration flag OverflowEnable is asserted, the PMMU will generate an OverflowStartedInt interrupt. When the OverflowStartedInt interrupt is generated, the size in bytes of the packet to overflow is written by the PMMU into the SPU-read-only configuration register SizeOfOverflowedPacket. At this point, the PMMU sets an internal lock flag that will prevent a new packet from overflowing. This lock flag is reset when the software writes into the on-the-fly configuration register OverflowAddress. If a packet needs to be overflowed but the lock flag is set, the packet will be dropped.
With this mechanism, it is guaranteed that only one interrupt will be generated and serviced per packet that is overflowed. This also creates a platform for software to decide where the starting address into which the next packet that will be overflowed will be stored is visible to the interrupt service routine through the SizeOfOverflowedPacket register. In other words, software manages the EPM.
If software writes the OverflowAddress multiple times in between two OverflowStartedInt interrupts, the results are undefined. Moreover, if software sets the 16 MSB bits of OverflowAddress to 0 . . . 1023, results are also undefined since the first 1K atomic pages in the packet memory correspond to the LPM.
Downloading a Packet from Packet Memory
Eventually the SPU will complete the processing of a packet and will inform the QS of the fact. At this point the packet may be downloaded from memory, either LPM or EPM, and sent, via the OB to one of the connected devices. FIG. 7 is a top-level schematic of the blocks of the XCaliber DMS processor involved in the downloading of a packet, and the elements in FIG. 7 are numbered the same as in FIG. 2. The downloading process may be followed in FIG. 7 with the aid of the following descriptions.
When QS 211 is informed that processing of a packet is complete, the QS marks this packet as completed and, a few cycles later (depending on the priority of the packet), the QS provides to PMMU 209 (as long as the PMMU has requested it) the following information regarding the packet: (a) the packetPage (b) the priority (cluster number from which it was extracted) (c) the tail growth/shrink information (described later in spec) (d) the outbound device identifier bit (e) the CRC type field (described later in spec) (f) the KeepSpace bit
The device identifier sent to PMMU block 209 is a 1-bit value that specifies the external device to which the packet will be sent. This outbound device identifier is provided by software to QS 211 as a 2-bit value.
If the packet was stored in LPM 219, PMMU 209 generates all of the (16-byte line) read addresses and read strobes to LPM 219. The read strobes are generated as soon as the read address is computed and there is enough space in OB 217 to buffer the line read from LPM 219. Buffer d in the OB is associated to device identifier d. This buffer may become fill for either two reasons: (a) The external device d temporarily does not accept data from XCaliber; or (b) The rate of reading data from the OB is lower than the rate of writing data into it.
As soon as the packet data within an atomic page has all been downloaded and sent to the OB, that atomic page can be de-allocated. The de-allocation of one or more atomic pages follows the same procedure as described above. However, no de-allocation of atomic pages occurs if the LPM bit is de-asserted. The KeepSpace bit is a don't care if the packet resides in EPM 701.
If the packet was stored in EPM 701, PMMU 209 provides to SIU 107 the address within the EPM where the first byte of the packet resides. The SIU performs the downloading of the packet from the EPM. The SIU also monitors the buffer space in the corresponding buffer in OB 217 to determine whether it has space to write the 16-byte chunk read from EPM 701. When the packet is fully downloaded, the SIU informs the PMMU of the fact so that the PMMU can download the next packet with the same device identifier.
When two packets (one per device) are being simultaneously sent, data from the packet with highest priority is read out of the memory first. This preemption can happen at a 16-byte boundary or when the packet finishes its transmission. If both packets have the same priority (provided by the QS), a round-robin method is used to select the packet from which data will be downloaded next. This selection logic also takes into account how full the two buffers in the OB are. If buffer d is full, for example, no packet with a device identifier d will be selected in the PM | | |