Home
Patent Search
IMT Blog
REGISTER
|
SIGN IN
United States Patent Application
20020071393
Kind Code
A1
Musoll, Enrique
June 13, 2002
Functional validation of a packet management unit
Abstract
A validation system is disclosed for validating function of a packet-management unit operationally coupled through a system interface to a processing unit of a processor system. The validation system comprises a user interface for creating an inputting test parameters and test code into the system, a test generator coupled to the user interface, the test generator for generating input packet activity in the form of a packet stream, a model coupled to the test generator for emulating separate and integrated function of the packet management unit, the system interface, and a stream-processing unit and an evaluation software for checking and validating or not validating results. The system validation function relies, in a preferred embodiment, on comparing output results with criteria of the selected test code resulting in an indication of pass or failure of the test. In a preferred embodiment, the system also notifies to cause of failure.
Inventors:
Musoll; Enrique
(San Jose, CA)
Correspondence Name and Address:
PO BOX 187
CENTRAL COAST PATENT AGENCY
AROMAS
CA
95004
US
Series Code:
954290
Filed:
September 11, 2001
U.S. Current Class:
370/248;
370/412
U.S. Class at Publication:
370/248;
370/412
Intern'l Class:
H04L 012/56
Claims
What is claimed is:
1. A validation system for validating function of a packet-management unit (PMU) operationally coupled through a system interface to a processing unit of a packet processor, the validation system comprising: a user interface for creating and inputting test parameters and test code into the system; a test generator coupled to the user interface, the test generator for generating input packet activity in the form of a packet stream; a model coupled to the test generator for emulating separate and integrated function of the packet management unit, the system interface, and a stream-processing unit; and an evaluation software for checking and validating or not validating results; characterized in that a user inputs criteria and a selected test code into the test generator whereupon the test generator generates an input packet stream of an associated workload for input into the model and whereupon the model processes the packets and generates output activity that is compared to criteria of the selected test code resulting in an indication of pass or failure of the test.
2. The validation system of claim 1 wherein the user interface is a computer.
3. The validation system of claim 1 wherein the model is a software model running on a processor-based machine.
4. The validation system of claim 1 wherein the test code comprises a plurality of values representing different combinations of possible test variables associated with treating data packets in process.
5. The validation system of claim 3 wherein the model emulates integrated function of a data packet router having hardware and software controlled memory.
6. The validation system of claim 4, wherein the test variables include the possibility of packet modification by software, packet insertion by software, packet dropping by hardware or software, and packet reordering by software.
7. The validation system of claim 6, wherein each of the test variables are configured to be constrained or not in any specific combination, a specific combination thereof equating to one selectable test code value of a plurality of configured values.
8. The validation system of claim 1, wherein the test is terminated after a specific number of cycles input before the test is performed.
9. The validation system of claim 1 wherein a sweep packet of low processing priority is input after the test packets, and the test is determined to be complete, and is terminated, when the sweep packet is output by the model under test.
10. The validation system of claim 1 wherein the test is determined to be complete, and is terminated, when the number of packets output by the model equals the number of test packets input, plus any packets generated by the model, less any packets dropped automatically.
11. The validation system of claim 1 wherein a packet identifier is associated with every test packet, and the workload to be executed when the packet is activated is known by referring to the identifier.
12. A method for validating function of a packet-management unit (PMU) operationally coupled through a system interface to a processing unit of a packet processor, comprising: (a) specifying a list of test parameters and selecting test code for use in a validation test run; (b) inputting the specified and selected data into a test generator for generating a test; (c) converting, within the generator, the specified and selected data values into input vectors representing a data packet stream and associated workload; (d) inputting the generated data packet stream and associated workload into a model, the model simulating singular and integrated functions of the packet-management unit, the system interface, and a stream processing unit; (e) outputting from the model, an output activity representing the input data packet stream after processing; and (f) examining the output activity according to input parameters and criteria of the selected test code to determine if the concluded test has passed or failed.
13. The method of claim 12 wherein in step (a) is performed by a user operating a computer.
14. The method of claim 12 wherein in step (d) the model is a software model running on a processor-based machine.
15. The method of claim 12 wherein in step (a) the test code comprises a plurality of values representing different combinations of possible test variables associated with treating data packets in process.
16. The method of claim 12 wherein in step (d) the model emulates integrated function of a data packet router having hardware and software controlled memory.
17. The method of claim 15 wherein in step (a) the test variables include the possibility of packet modification by software, packet insertion by software, packet dropping by hardware or software, and packet reordering by software.
18. The method of claim 17 wherein in step (a) each of the test variables are configured to be constrained or not in any specific combination, a specific combination thereof equating to one selectable test code value of a plurality of configured values.
19. The method of claim 12 wherein in step (b) the specified data comprises determined value ranges assigned to a plurality of pre-determined characteristics of packet processing function.
20. The method of claim 12 wherein in step (d) inputting the generated data packet stream is an automated process.
21. The method of claim 12 wherein a step (g) this added in case of failure at step (f) wherein notification is sent back to the user containing an explanation of the cause of failure.
22. The method of claim 12 comprising an additional step to determine a test is complete wherein the test is terminated after a specific number of cycles input before the test is performed.
23. The method of claim 12 comprising an additional step to determine a test is complete by inserting after the test packets a sweep packet of low processing priority, and determining the test to be complete when the sweep packet is output by the model under test.
24. The method of claim 12 comprising an additional step to determine a test is when the number of packets output by the model equals the number of test packets input, plus any packets generated by the model, less any packets dropped automatically.
25. The method of claim 12 comprising steps for associating a packet identifier with every test packet, and determining the workload to be executed when the packet is activated by referring to the identifier.
Description
CROSS-REFERENCE TO RELATED DOCUMENTS
[0001] The present invention is a continuation in part (CIP) to a U.S. patent application Ser. No. 09/737,375 entitled "Queuing System for Processors in Packet Routing Operations" and filed on Dec. 14, 2000, which is included herein by reference. In addition, Ser. No. 09/737,375
claims priority benefit under 35 U.S.C. 119 (e) of Provisional Patent Application Ser. No. 60/181,364 filed on Feb. 8, 2000, and incorporates all disclosure of the prior application by reference. The present application is also a continuation in part of applications Ser. No. 09/608,750, filed on Jun. 30, 2000 and Ser. No. 09/602,279, filed Jun. 23, 2000 and incorporates all of their disclosure by reference.
FIELD OF THE INVENTION
[0002] The present invention is in the field of digital processing and pertains to apparatus and methods for processing packets in routers for packet networks, and more particularly to apparatus and methods for validating packet management hardware functions and design integrity in process.
BACKGROUND OF THE INVENTION
[0003] The well-known Internet network is a notoriously well-known publicly-accessible communication network at the time of filing the present patent application, and arguably the most robust information and communication source ever made available. The Internet is used as a prime example in the present application of a data-packet-network which will benefit from the apparatus and methods taught in the present patent application, but is just one such network, following a particular standardized protocol. As is also very well known, the Internet (and related networks) are always a work in progress. That is, many researchers and developers are competing at all times to provide new and better apparatus and methods, including software, for enhancing the operation of such networks.
[0004] In general the most sought-after improvements in data packet networks are those that provide higher speed in routing (more packets per unit time) and better reliability and fidelity in messaging. What is generally needed are router apparatus and methods increasing the rates at which packets may be processed in a router.
[0005] As is well-known in the art, packet routers are computerized machines wherein data packets are received at any one or more of typically multiple ports, processed in some fashion, and sent out at the same or other ports of the router to continue on to downstream destinations. As an example of such computerized operations, keeping in mind that the Internet is a vast interconnected network of individual routers, individual routers have to keep track of which external routers to which they are connected by communication ports, and of which of alternate routes through the network are the best routes for incoming packets. Individual routers must also accomplish flow accounting, with a flow generally meaning a stream of packets with a common source and end destination. A general desire is that individual flows follow a common path. The skilled artisan will be aware of many such requirements for computerized processing.
[0006] Typically a router in the Internet network will have one or more Central Processing Units (CPUs) as dedicated microprocessors for accomplishing the many computing tasks required. In the current art at the time of the present application, these are single-streaming processors; that is, each processor is capable of processing a single stream of instructions. In some cases developers are applying multiprocessor technology to such routing operations. The present inventors have been involved for some time in development of dynamic multi-streaming (DMS) processors, which processors are capable of simultaneously processing multiple instruction streams. One preferred application for such processors is in the processing of packets in packet networks like the Internet.
[0007] In the provisional patent application listed in the Cross-Reference to Related Documents above there are descriptions and drawings for a preferred architecture for DMS application to packet processing. One of the functional areas in that architecture is a packet management unit (PMU) comprising hardware and circuitry for processing data packets.
[0008] As described with reference to Ser. No. 09/737,375 in FIG. 1 above the PMU is the part of the processor, known as the XCaliber processor in some instances, that offloads the streaming processor unit (SPU) from performing costly packet header accesses and packet sorting and management tasks, which might otherwise seriously degrade performance of the overall processor.
[0009] Packet management functions of the PMU include managing on-chip local packet memory (LPM) for packet storage, uploading packet header information from incoming packets into different contexts registers of the XCaliber processor, and maintaining packet identifiers of the packets currently in process in the XCaliber processor.
[0010] There are at least two known means of functionally verifying a PMU. One of these involves using well-known verification techniques, but these are suitable typically for only small designs, and the formal verification technology is not advanced enough. Another is to compare performance of a PMU of unknown quality with an already-verified model. A model can be a completed and functional chip, a model made of pieces of other chips, or a model made of part hardware and part software. A problem here is that, for PMUs of the sort to be tested and verified, there is no verified model, and a first model needs to be verified somehow.
[0011] Therefore, what is clearly needed is a reliable and cost-effective method and apparatus for validating packet-managing (PMU) functions in a packet processor, in the absence of an existing and verified model. The present invention teaches apparatus and methods to fill this need.
SUMMARY OF THE INVENTION
[0012] In a preferred embodiment of the present invention, a validation system is provided for validating function of a packet-management unit (PMU) operationally coupled through a system interface to a processing unit of a packet processor. The validation system comprises a user interface for creating and inputting test parameters and test code into the system, a test generator coupled to the user interface, the test generator for generating input packet activity in the form of a packet stream, a model coupled to the test generator for emulating separate and integrated function of the packet management unit, the system interface, and the stream-processing unit, and an evaluation software for checking and validating or not validating results.
[0013] A user inputs criteria and a selected test code into the test generator whereupon the test generator generates an input packet stream and an associated workload for input into the model and whereupon the model processes the packets and generates output activity that is compared to criteria of the selected test code resulting in an indication of pass or failure of the test.
[0014] In one aspect, the user interface is a computer. Also, in one aspect, the model is a software model running on a processor-based machine. In a preferred aspect, the test code comprises a plurality of values representing different combinations of possible test variables associated with treating data packets in process. In a preferred aspect, the model emulates integrated function of a data packet router having hardware and software controlled memory.
[0015] In a preferred aspect, the test variables include the possibility of packet modification by software, packet insertion by software, packet dropping by hardware or software, and packet reordering by software. In this aspect, each of the test variables are configured to be constrained or not in any specific combination, a specific combination thereof equating to one selectable test code value of a plurality of configured values.
[0016] In some cases the test is terminated after a specific number of cycles input before the test is performed; while in other a sweep packet of low processing priority is input after the test packets, and the test is determined to be complete, and is terminated, when the sweep packet is output by the model under test; and in still other cases the test is determined to be complete, and is terminated, when the number of packets output by the model equals the number of test packets input, plus any packets generated by the model.
[0017] In preferred embodiments a packet identifier is associated with every test packet, and the workload to be executed when the packet is activated is known by referring to the identifier.
[0018] In another aspect of the present invention, a method is provided for validating function of a packet-management unit (PMU) operationally coupled through a system interface to a processing unit of a packet processor. The method comprises the steps of, (a) specifying a list of test parameters and selecting test code for use in a validation test run, (b) inputting the specified and selected data into a test generator for generating a test; (c) converting, within the generator, the specified and selected data values into input vectors representing a data packet stream and associated workload, (d) inputting the generated data packet stream and associated workload into a model, the model simulating singular and integrated functions of the packet-management unit, the system interface, and the stream processing unit, (e) outputting from the model, an output activity representing the input data packet stream after processing and (f) examining the output activity according to input parameters and criteria of the selected test code to determine if the concluded test has passed or failed.
[0019] In a preferred embodiment, step (a) is performed by a user operating a computer. In one aspect of the method in step (d), the model is a software model running on a processor-based machine. In preferred aspects of the method in step (a), the test code comprises a plurality of values representing different combinations of possible test variables associated with treating data packets in process. In one aspect of the method in step (d), the model emulates integrated function of a data packet router having hardware and software controlled memory.
[0020] In a preferred aspect of the method in step (a) the test variables include the possibility of packet modification by software, packet insertion by software, packet dropping by hardware or software, and packet reordering by software. In this aspect, each of the test variables are configured to be constrained or not in any specific combination, a specific combination thereof equating to one selectable test code value of a plurality of configured values. In one aspect of the method in step (b), the specified data comprises determined value ranges assigned to a plurality of pre-determined characteristics of packet processing function. In another aspect of the method in step (d) inputting the generated data packet stream is an automated process. In alternative aspect of the method, a step (g) is added in case of failure at step (f) wherein notification is sent back to the user containing an explanation of the cause of failure.
[0021] In some cases of the method there is an additional step to determine a test is complete wherein the test is terminated after a specific number of cycles input before the test is performed; and in other cases an additional step determines a test is complete by inserting after the test packets a sweep packet of low processing priority, and determining the test to be complete when the sweep packet is output by the model under test; and in still other cases an additional step to determine a test is when the number of packets output by the model equals the number of test packets input, plus any packets generated by the model.
[0022] In some embodiments of the method there are steps for associating a packet identifier with every test packet, and determining the workload to be executed when the packet is activated by referring to the identifier.
[0023] Now, for the first time, a reliable and cost-effective method and apparatus is provided for validating PMU function in a packet processor. A method such as this is used to validate PMU functionality under simulation and to accurately troubleshoot any design flaws or performance issues before field implementation.
BRIEF DESCRIPTION OF THE DRAWINGS
[0024] FIG. 1 is a simplified block diagram showing relationship of functional areas of a DMS processor in a preferred embodiment of the present invention.
[0025] FIG. 2 is a block diagram of the DMS processor of FIG. 1 showing additional detail.
[0026] FIG. 3 is a block diagram illustrating uploading of data into the LPM or EPM in an embodiment of the invention.
[0027] FIG. 4a is a diagram illustrating determination and allocation for data uploading in an embodiment of the invention.
[0028] FIG. 4b is a diagram showing the state that needs to be maintained for each of the four 64 KB blocks.
[0029] FIGS. 5a and 5b illustrate an example of how atomic pages are allocated in an embodiment of the present invention.
[0030] FIGS. 6a and 6b illustrate how memory space is efficiently utilized in an embodiment of the invention.
[0031] FIG. 7 is a top-level schematic of the blocks of the XCaliber PMU unit involved in the downloading of a packet.
[0032] FIG. 8 is a diagram illustrating the phenomenon of packet growth and shrink.
[0033] FIG. 9 is a block diagram showing high-level communication between the QS and other blocks in the PMU and SPU in an embodiment of the present invention.
[0034] FIG. 10 is a table illustrating six different modes in an embodiment of the invention into which the QS can be configured.
[0035] FIG. 11 is a diagram illustrating generic architecture of the QS of FIGS. 2 and 7 in an embodiment of the present invention.
[0036] FIG. 12 is a table indicating coding of the outbound DeviceId field in an embodiment of the invention.
[0037] FIG. 13 is a table illustrating priority mapping for RTU transfers in an embodiment of the invention.
[0038] FIG. 14 is a table showing allowed combinations of Active, Completed, and Probed bits for a valid packet in an embodiment of the invention.
[0039] FIG. 15 is a Pattern Matching Table in an embodiment of the present invention.
[0040] FIG. 16 illustrates the format of a mask in an embodiment of the invention.
[0041] FIG. 17 shows an example of a pre-load operation using the mask in FIG. 16.
[0042] FIG. 18 illustrates shows the PMU Configuration Space in an embodiment of the present invention.
[0043] FIGS. 19a, 19b and 19c are a table of Configuration register Mapping.
[0044] FIG. 20 is an illustration of a PreloadMaskNumber configuration register.
[0045] FIG. 21 illustrates a PatternMatchingTable in a preferred embodiment of the present invention.
[0046] FIG. 22 illustrates a VirtualPageEnable configuration register in an embodiment of the invention.
[0047] FIG. 23 illustrates a ContextSpecificPatternMatchingMask configuration register in an embodiment of the invention.
[0048] FIG. 24 illustrates the MaxActivePackets configuration register in an embodiment of the present invention.
[0049] FIG. 25 illustrates the TimeCounter configuration register in an embodiment of the present invention.
[0050] FIG. 26 illustrates the StatusRegister configuration register in an embodiment of the invention.
[0051] FIG. 27 is a schematic of a Command Unit and command queues in an embodiment of the present invention.
[0052] FIG. 28 is a table showing the format of command inserted in command queues in an embodiment of the present invention.
[0053] FIG. 29 is a table showing the format for responses that different blocks generate back to the CU in an embodiment of the invention.
[0054] FIG. 30 shows a performance counter interface between the PMU and the SIU in an embodiment of the invention.
[0055] FIG. 31 shows a possible implementation of internal interfaces among the different units in the PMU in an embodiment of the present invention.
[0056] FIG. 32 is a diagram of a BypassHooks configuration register in an embodiment of the invention.
[0057] FIG. 33 is a diagram of an InternalStateWrite configuration register in an embodiment of the invention.
[0058] FIGS. 34-39 comprise a table listing events related to performance counters in an embodiment of the invention.
[0059] FIG. 40 is a table illustrating the different bypass hooks implemented in the PMU in an embodiment of the invention.
[0060] FIG. 41 is a table relating architecture and hardware blocks in an embodiment of the present invention.
[0061] FIGS. 42-45 comprise a table showing SPU-PMU Interface in an embodiment of the invention.
[0062] FIGS. 46-49 comprise a table showing SIU-PMU Interface in an embodiment of the invention.
[0063] FIG. 50 is a diagram of a unit configuration of a multi-streaming processor according to an embodiment of the present invention.
[0064] FIG. 51 is a diagram of valid ordering of flows, according to an embodiment of the present invention.
[0065] FIG. 52 is a diagram of the PMU validation environment according to an embodiment of the present invention.
[0066] FIG. 53 is a flow diagram illustrating an automated validation test process according to an embodiment of the present invention.
[0067] FIG. 54 is a table illustrating generated test codes according to an embodiment of the present invention.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0068] In the provisional patent application Ser. No. 60/181,364
referenced above there is disclosure as to the architecture of a DMS processor, termed by the inventors the XCaliber processor, which is dedicated to packet processing in packet networks. Two extensive diagrams are provided in the referenced disclosure, one, labeled NIO Block Diagram, shows the overall architecture of the XCaliber processor, with input and output ports to and from a packet-handling ASIC, and the other illustrates numerous aspects of the Generic Queue shown in the NIO diagram. The NIO system in the priority document equates to the Packet Management Unit (PMU) in the present specification. It is to the several aspects of the generic queue that the present application is directed.
[0069] FIG. 1 is a simplified block diagram of an XCaliber DMS processor 101 with a higher-level subdivision of functional units than that shown in the NIO diagram of the priority document. In FIG. 1 XCaliber DMS processor 101 is shown as organized into three functional areas. An outside System Interface Unit (SIU) area 107 provides communication with outside devices, that is, external to the XCaliber processor, typically for receiving and sending packets. Inside, processor 101 is divided into two broad functional units, a Packet Management Unit (PMU) 103, equating to the NIO system in the priority document mentioned above, and a Stream Processor Unit (SPU) 107. The functions of the PMU include accounting for and managing all packets received and processed. The SPU is responsible for all computational tasks.
[0070] The PMU is a part of the XCaliber processor that offloads the SPU from performing costly packet header accesses and packet sorting and management tasks, which would otherwise seriously degrade performance of the overall processor.
[0071] Packet management is achieved by (a) Managing on-chip memory allocated for packet storage, (b) Uploading, in the background, packet header information from incoming packets into different contexts (context registers, described further below) of the XCaliber processor, (c) Maintaining, in a flexible queuing system, packet identifiers of the packets currently in process in the XCaliber.
[0072] The described packet management and accounting tasks performed by the PMU are performed in parallel with processing of packets by the SPU core. To implement this functionality, the PMU has a set of hardware structures to buffer packets incoming from the network, provide them to the SPU core and, if needed, send them out to the network when the processing is completed. The PMU features a high degree of programmability of several of its functions, such as configuration of its internal packet memory storage and a queuing system, which is a focus of the present patent application.
[0073] FIG. 2 is a block diagram of the XCaliber processor of FIG. 1
showing additional detail. SIU 107 and SPU 105 are shown in FIG. 2 as single blocks with the same element numbers used in FIG. 1. The PMU is shown in considerably expanded detail, however, with communication lines shown between elements.
[0074] In FIG. 2 there is shown a Network/Switching Fabric Interface 203
which is in some cases an Application Specific Integrated Circuit (ASIC) dedicated for interfacing directly to a network, such as the Internet for example, or to switching fabric in a packet router, for example, receiving and transmitting packets, and transacting the packets with the XCaliber processor. In this particular instance there are two in ports and two out ports communicating with processor 201. Network in and out interface circuitry 205 and 215 handle packet traffic onto and off the processor, and these two interfaces are properly a part of SIU 107, although they are shown separately in FIG. 2 for convenience.
[0075] Also at the network interface within the PMU there are, in processor 201, input and output buffers 207 and 217 which serve to buffer the flow of packets into and out of processor 201.
[0076] Referring again to FIG. 1, there is shown a Packet Management Unit (PMU) 103, which has been described as a unit that offloads the requirement for packet management and accounting from the Stream Processing Unit. This is in particular the unit that has been expanded in FIG. 2. and consists substantially of Input Buffer (IB) 207, Output Buffer (OB) 217, Paging Memory Management Unit (PMMU) 209, Local Packet Memory (LPM) 219, Command Unit (CU) 213, Queueing System (QS) 211, Configuration Registers 221, and Register Transfer Unit (RTU) 227. The communication paths between elements of the PMU are indicated by arrows in FIG. 2, and further description of the elements of the PMU is provided below, including especially QS 211, which is a particular focus of the present patent application.
[0077] Overview of PMU
[0078] Again, FIG. 2 shows the elements of the PMU, which are identified briefly above. Packets arrive to the PMU in the present example through a 16-byte network input interface. In this embodiment packet data arrives to the PMU at a rate of 20 Gbps (max). At an operating speed of 300 MHz XCaliber core frequency, an average of 8 bytes of packet data are received every XCaliber core cycle. The incoming data from the network input interface is buffered in InBuffer (IB) block 207. Network interface 205 within XCaliber has the capability of appending to the packet itself the size of the packet being sent, in the event that the external device has not been able to append the size to the packet before sending the packet. Up to 2 devices can send packet data to XCaliber at (10 Gbps per device), and two in ports are shown from an attached ASIC. It is to be understood that the existence and use of the particular ASIC is exemplary, and packets could be received from other devices. Further, there may be in some embodiments more or fewer than the two in ports indicated.
[0079] Packet Memory Manager Unit (PMMU) 209 decides whether each incoming packet has to be stored into on-chip Local Packet Memory (LPM) 219, or, in the case that, for example, no space exists in the LPM to store it, may decide to either send the packet out to an External Packet Memory (EPM) not shown through the SIU block, or may decide to drop the packet. In case the packet is to be stored in the LPM, the PMMU decides where to store the packet and generates all the addresses needed to do so. The addresses generated correspond in a preferred embodiment to 16-byte lines in the LPM, and the packet is consecutively stored in this memory.
[0080] In the (most likely) case that the PMMU does not drop the incoming packet, a packet identifier is created, which includes a pointer (named packetPage) to a fixed-size page in packet memory where the packet has started to be stored. The identifier is created and enqueued into Queuing System (QS) block 211. The QS assigns a number from 0 to 255 (named packetNumber) to each new packet. The QS sorts the identifiers of the packets alive in XCaliber based on the priority of the packets, and it updates the sorting when the SPU core notifies any change on the status of a packet. The QS selects which packet identifiers will be provided next to the SPU. Again, the QS is a particular focus of the present application.
[0081] Register Transfer Unit (RTU) block 227, upon receiving a packet identifier (packetpage and packetNumber) from the QS, searches for an available context (229, FIG. 2) out of 8 contexts that XCaliber features in a preferred embodiment. For architectural and description purposes the contexts are considered a part of a broader Stream Processing Unit, although the contexts are shown in FIG. 2 as a separate unit 229.
[0082] In the case that no context is available, the RTU has the ability to notify the SPU about this event through a set of interrupts. In the case that a context is available, the RTU loads the packet identifier information and some selected fields of the header of the packet into the context, and afterwards it releases the context (which will at that time come under control of the SPU. The RTU accesses the header information of the packet through the SIU, since the packet could have been stored in the off-chip EPM.
[0083] Eventually a stream in the SPU core processes the context and notifies the QS of this fact. There are, in a preferred embodiment, eight streams in the DMS core. The QS then updates the status of the packet (to completed), and eventually this packet is selected for downloading (i.e. the packet data of the corresponding packet is sent out of the XCaliber processor to one of the two external devices).
[0084] When a packet is selected for downloading, the QS sends the packetPage (among other information) to the PMMU block, which generates the corresponding line addresses to read the packet data from the LPM (in case the packet was stored in the on-chip local memory) or it will instruct the SIU to bring the packet from the external packet memory to the PMU. In any case, the lines of packet data read are buffered into the OutBuffer (OB) block, and from there sent out to the device through the 16-byte network output interface. This interface is independent of its input counterpart. The maximum aggregated bandwidth of this interface in a preferred embodiment is also 20 Gbps, 10 Gbps per output device.
[0085] CommandUnit (CU) 213 receives commands sent by SPU 105. A command corresponds to a packet instruction, which are in many cases newly defined instructions, dispatched by the SPU core. These commands are divided into three independent types, and the PMU can execute one command per type per cycle (for a total of up to 3 commands per cycle). Commands can be load-like or store-like (depending on whether the PMU provides a response back to the SPU or not, respectively).
[0086] A large number of features of the PMU are configured by the SPU through memory-mapped configuration registers 221. Some such features have to be programmed at boot time, and the rest can be dynamically changed. For some of the latter, the SPU has to be running in a single-thread mode to properly program the functionality of the feature. The CU block manages the update of these configuration registers.
[0087] The PMU provides a mechanism to aid in flow control between ASIC 203 and XCaliber DMS processor 201. Two different interrupts are generated by the PMU to SPU 105 when LPM 219 or QS 211 are becoming full. Software controls how much in advance the interrupt is generated before the corresponding structure becomes completely full. Software can also disable the generation of these interrupts.
[0088] LPM 219 is also memory mapped, and SPU 105 can access it through the conventional load/store mechanism. Both configuration registers 221
and LPM 219 have a starting address (base address) kept by SIU 107. Requests from SPU 105 to LPM 219 and the configuration space arrive to the PMU through SIU block 107. The SIU is also aware of the base address of the external packet memory.
[0089] In Buffer (IB)
[0090] Packet data sent by an external device arrives to the PMU through the network input interface 205 at an average rate of 8 bytes every XCaliber core cycle in a preferred embodiment. IB block 207 of the PMU receives this data, buffers it, and provides it, in a FIFO-like fashion, to LPM 219 and in some cases also to the SIU (in case of a packet overflow, as explained elsewhere in this specification.
[0091] XCaliber DMS processor 201 can potentially send/receive packet data to/from up to 2 independent devices. Each device is tagged in SIU 107
with a device identifier, which is provided along with the packet data. When one device starts sending data from a packet, it will continue to send data from that very same packet until the end of the packet is reached or a bus error is detected by the SIU.
[0092] In a preferred embodiment the first byte of a packet always starts at byte 0 of the first 16 bytes sent of that packet. The first two bytes of the packet specify the size in bytes of the packet (including these first two bytes). These two bytes are always appended by the SIU if the external device has not appended them. If byte k in the 16-byte chunk is a valid byte, bytes 0 . . . k-1 are also valid bytes. This can be guaranteed since the first byte of a packet always starts at byte 0. Note that no valid bits are needed to validate each byte since a packet always starts at byte 0 of the 16-byte chunk, and the size of the packet is known up front (in the first two bytes). The network interface provides, at every core clock, a control bit specifying whether the 16-byte chunk contains, at least, one valid byte.
[0093] The valid data received from the network input interface is organized in buffer 207. This is an 8-entry buffer, each entry holding the 16-bytes of data plus the control bits associated to each chunk. PMMU 209 looks at the control bits in each entry and determines whether a new packet starts or to which of the (up to) two active packets the data belongs to, and it acts accordingly.
[0094] The 16-byte chunks in each of the entries in IB 207 are stored in LPM 219 or in the EPM (not shown). It is guaranteed by either the LPM controller or the SIU that the bandwidth to write into the packet memory will at least match the bandwidth of the incoming packet data, and that the writing of the incoming packet data into the packet memory will have higher priority over other accesses to the packet memory.
[0095] In some cases IB 207 may get full because PMMU 209 may be stalled, and therefore the LPM will not consume any more data of the IB until the stall is resolved. Whenever the IB gets full, a signal is sent to network input interface 205, which will retransmit the next 16-byte chunk as many times as needed until the IB accepts it. Thus, no packet data is lost due to the IB getting full.
[0096] Out Buffer (OB)
[0097] Network output interface 215 also supports a total aggregated bandwidth of 20 Gbps (10 Gbps per output device), as does the Input Interface. At 300 MHz XCaliber clock frequency, the network output interface accepts in average 8 bytes of data every XCaliber cycle from the OB block, and sends it to one of the two output devices. The network input and output interfaces are completely independent of each other.
[0098] Up to 2 packets (one per output device) can be simultaneously sent. The device to which the packet is sent does not need to correspond to the device that sent the packet in. The packet data to be sent out will come from either LPM 219 or the EPM (not shown).
[0099] For each of the two output devices connected at Network Out interface 215, PMMU 209 can have a packet ready to start being downloaded, a packet being downloaded, or no packet to download. Every cycle PMMU 209 selects the highest packet across both output devices and initiates the download of 16 bytes of data for that packet. Whenever the PMMU is downloading packet data from a packet to an output device, no data from a different packet will be downloaded to the same device until the current packet is completely downloaded.
[0100] The 16-byte chunks of packet data read from LPM 219 (along with some associated control information) are fed into one of the two 8-entry buffers (one per device identifier). The contents of the head of one of these buffers is provided to the network output interface whenever this interface requests it. When the head of both buffers is valid, the OB provides the data in a round robin fashion.
[0101] Differently than the network input interface, in the 16-byte chunk sent to the network output interface it can not be guaranteed that if a byte k is valid, then bytes 0 . . . k-1 are valid as well. The reason for this is that when the packet is being sent out, it does not need to start at byte 0 of the 16-byte chunk in memory. Thus, for each 16-byte chunk of data that contains the start of the packet to be sent out, OB 217 needs to notify the network interface where the first valid byte of the chunk resides. Moreover, since the first two bytes of the packet contain the size of the packet in bytes, the network output interface has the information to figure out where the last valid byte of the packet resides within the last 16-byte chunk of data for that packet. Moreover, OB 217
also provides a control bit that informs SIU 107 whether it needs to compute CRC for the packet, and if so, which type of CRC. This control bit is provided by PMMU 209 to OB 217.
[0102] Paging Memory Management Unit (PMMU)
[0103] The packet memory address space is 16 MB. Out of the 16 MB, the XCaliber processor features 256 KB on-chip. The rest (or a fraction) is implemented using external storage.
[0104] The packet memory address space can be mapped in the TLB of SPU 105
as user or kernel space, and as cachable or uncachable. In case it is mapped cachable, the packet memory space is cached (write-through) into an L1 data cache of SPU 105, but not into an L2 cache.
[0105] A goal of PMMU 209 is to store incoming packets (and SPU-generated packets as well) into the packet memory. In case a packet from the network input interface fits into LPM 219, PMMU 209 decides where to store it and generates the necessary write accesses to LPM 219; in case the packet from the network input interface is going to be stored in the EPM, SPU 105 decides where in the EPM the packet needs to be stored and SIU 107 is in charge of storing the packet. In either case, the packet is consecutively stored and a packet identifier is created by PMMU 209 and sent to QS 211.
[0106] SPU 105 can configure LPM 219 so packets larger than a given size will never be stored in the LPM. Such packets, as well as packets that do not fit into the LPM because lack of space, are sent by PMMU 209 to the EPM through SIU 107. This is a mechanism called overflow and is configured by the SPU for the PMU to do so. If no overflow of packets is allowed, then the packet is dropped. In this case, PMMU 209 interrupts the SPU (again, if configured to do so).
[0107] Uploading a Packet Into Packet Memory
[0108] Whenever there is valid data at the head of IB 205, the corresponding device identifier bit is used to determine to which packet (out of the two possible packets being received) the data belongs. When the network input interface starts sending data of a new packet with device identifier d, all the rest of the data will eventually arrive with that same device identifier d unless an error is notified by the network interface block. The network input interface can interleave data from two different device identifiers, but in a given cycle only data from one device is received by IB 207.
[0109] When a packet needs to be stored into LPM 219, PMMU block 209
generates all the write addresses and write strobes to LPM 219. If the packet needs to be stored into the EPM, SIU 107 generates them.
[0110] FIG. 3 is a diagram illustrating uploading of data into either LPM 219 or the EPM, which is shown in FIG. 3 as element 305, but not shown in FIG. 2. The write strobe to the LPM or EPM will not be generated unless the header of the IB has valid data. Whenever the write strobe is generated, the 16-byte chunk of data at the head of the IB (which corresponds to a LPM line) is deleted from the IB and stored in the LPM or EPM. The device identifier bit of the head of the IB is used to select the correct write address out of the 2 address generators (one per input device).
[0111] In the current embodiment only one incoming packet can be simultaneously stored in the EPM by the SIU (i.e. only one overflow packet can be handled by the SIU at a time). Therefore, if a second packet that needs to be overflowed is sent by the network input interface, the data of this packet will be thrown away (i.e. the packet will be dropped).
[0112] A Two Byte Packet-size Header
[0113] The network input interface always appends two bytes to a packet received from the external device (unless this external device already does so, in which case the SIU will be programmed not to append them). This appended data indicates the size in bytes of the total packet, including the two appended bytes. Thus, the maximum size of a packet that is processed by the XCaliber DMS processor is 65535 bytes including the first two bytes.
[0114] The network output interface expects that, when the packet is returned by the PMU (if not dropped during its processing), the first two bytes also indicate the size of the processed packet. The size of the original packet can change (the packet can increase or shrink) as a result of processing performed by the XCaliber processor. Thus, if the processing results in increasing the size beyond 64 K-1 bytes, it is the responsibility of software to chop the packet into two different smaller packets.
[0115] The PMU is more efficient when the priority of the packet being received is known up front. The third byte of the packet will be used for priority purpose if the external device is capable of providing this information to the PMU. The software programs the PMU to either use the information in this byte or not, which is does through a boot-time configuration register named Log2InQueues.
[0116] Dropping a Packet
[0117] A packet completely stored in either LPM 219 or EPM 305 will be dropped only if SPU 105 sends an explicit command to the PMU to do so. No automatic dropping of packets already stored in the packet memory can occur. In other words, any dropping algorithm of packets received by the XCaliber DMS processor is implemented in software.
[0118] There are, however, several situations wherein the PMU may drop an incoming packet. These are (a) The packet does not fit in the LPM and the overflow of packets is disabled, (b) The total amount of bytes received for the packet is not the same as the number of bytes specified by the ASIC in the first two bytes of the ASIC-specific header, or (c) A transmission error has occurred between the external device and the network input interface block of the SIU. The PMMU block is notified about such an error.
[0119] For each of the cases (a), (b) and (c) above, an interrupt is generated to the SPU. The software can disable the generation of these interrupts using AutomaticPacketDropIntEnable, PacketErrorIntEnable on-the-fly configuration flags.
[0120] Virtual Pages
[0121] An important process of PMMU 209 is to provide an efficient way to consecutively store packets into LPM 219 with as little memory fragmentation as possible. The architecture in the preferred embodiment provides SPU 105 with a capability of grouping, as much as possible, packets of similar size in the same region of LPM 219. This reduces overall memory fragmentation.
[0122] To implement the low-fragmentation feature, LPM 219 is logically divided into 4 blocks of 64 KB bytes each. Each block is divided into fixed atomic pages of 256 bytes. However, every block has virtual pages that range from 256 bytes up to 64 KB, in power-of-2 increments. Software can enable/disable the different sizes of the virtual pages for each of the 4 blocks using an on-the-fly configuration register named VirtualPageEnable. This allows configuring some blocks to store packets of up to a certain size.
[0123] The organization and features of the PMU assure that a packet of size s will never be stored in a block with a maximum virtual page size less than s. However, a block with a minimum virtual page size of r will accept packets of size smaller than r. This will usually be the case, for example, in which another block or blocks are configured to store these smaller packets, but is full.
[0124] Software can get ownership of any of the four blocks of the LPM, which implies that the corresponding 64 KB of memory will become software managed. A configuration flag exists per block (SoftwareOwned) for this purpose. The PMMU block will not store any incoming packet from the network input interface into a block in the LPM with the associated SoftwareOwned flag asserted. Similarly, the PMMU will not satisfy a GetSpace operation (described elsewhere) with memory of a block with its SoftwareOwned flag asserted. The PMMU, however, is able to download any packet stored by software in a software-owned block.
[0125] The PMMU logic determines whether an incoming packet fits in any of the blocks of the LPM. If a packet fits, the PMMU decides in which of the four blocks (since the packet may fit in more than one block), and the first and last atomic page that the packet will use in the selected block. The atomic pages are allocated for the incoming packet. When packet data stored in an atomic page has been safely sent out of the XCaliber processor through the network output interface, the corresponding space in the LPM can be de-allocated (i.e. made available for other incoming packets).
[0126] The EPM, like the LPM is also logically divided into atomic pages of 256 bytes. However, the PMMU does not maintain the allocation status of these pages. The allocation status of these pages is managed by software. Regardless of where the packet is stored, the PMMU generates an offset (in atomic pages) within the packet memory to where the first data of the packet is stored. This offset is named henceforth packetPage. Since the maximum size of the packet memory is 16 MB, the packetPage is a 16-bit value.
[0127] As soon as the PMMU safely stores the packet in the LPM, or receives acknowledgement from SIU 107 that the last byte of the packet has been safely stored in the EPM, the packetPage created for that packet is sent to the QS. Operations of the QS are described in enabling detail below.
[0128] Generating the PacketPage Offset
[0129] The PMMU always monitors the device identifier (deviceId) associated to the packet data at the head of the IB. If the deviceId is not currently active (i.e. the previous packet sent by that deviceId has been completely received), that indicates that the head of the IB contains the first data of a new packet. In this case, the first two bytes (byte 0 and byte 1 in the 16-byte chunk) specify the size of the packet in bytes. With the information of the size of the new incoming packet, the PMMU determines whether the packet fits into LPM 219 and, if it does, in which of the four blocks it will be stored, plus the starting and ending atomic pages within that block.
[0130] The required throughput in the current embodiment of the PMMU to determine whether a packet fits in LPM 219 and, if so, which atomic pages are needed, is one packet every two cycles. One possible two-cycle implementation is as follows: (a) The determination happens in one cycle, and only one determination happens at a time (b) In the cycle following the determination, the atomic pages needed to store the packet are allocated and the new state (allocated/de-allocated) of the virtual pages are computed. In this cycle, no determination is allowed.
[0131] FIG. 4a is a diagram illustrating determination and allocation in parallel for local packet memory. The determination logic is performed in parallel for all of the four 64 KB blocks as shown.
[0132] FIG. 4b shows the state that needs to be maintained for each of the four 64 KB blocks. This state, named AllocationMatrix, is recomputed every time one or more atomic pages are allocated or de-allocated, and it is an input for the determination logic. The FitsVector and IndexVector contain information computed from the AllocationMatrix.
[0133] AllocationMatrix[VPSize][VPIndex] indicates whether virtual page number VPIndex of size VPSize in bytes is already allocated or not. FitsVector[VPSize] indicates whether the block has at least one non-allocated virtual page of size VPSize. If FitsVector[VPSize] is asserted, IndexVector[VPSize] vector contains the index of a non-allocated virtual page of size VPSize.
[0134] The SPU programs which virtual page sizes are enabled for each of the blocks. The EnableVector[VPSize] contains this information. This configuration is performed using the VirtualPageEnable on-the-fly configuration register. Note that the AllocationMatrix[ ] [ ], FitsVector[ ], IndexVector[ ] and EnableVector[ ] are don't cares if the corresponding SoftwareOwned flag is asserted.
[0135] In this example the algorithm for the determination logic (for a packet of size s bytes) is as follows:
[0136] 1) Fits logic: check, for each of the blocks, whether the packet fits in or not. If it fits, remember the virtual page size and the number of the first virtual page of that size.
1
For All Block j Do (can be done in parallel): Fits[j] = (s <= VPSize) AND FitsVector[VPSize] AND Not SoftwareOwned where VPSize is the smallest possible page size. If (Fits [j]) VP Index[j] = IndexVector[VPSize] MinVPS[j] = VPSize Else MinVPS[j] = <Infinity>
[0137] 2) Block selection: the blocks with the smallest virtual page (enabled or not) that is able to fit the packet in are candidates. The block with the smallest enabled virtual page is selected.
2
If Fits[j] = FALSE for all j Then <Packet does not fit in LPM> packetPage = OverflowAddress>> 8
Else C = set of blocks with smallest MinVPS AND Fits[MinVPS] B = block# in C with the smallest enabled virtual page (if more than one exists, pick the smallest block number) If one or more blocks in C have virtual pages enabled Then Index = VPIndex[B] VPSize = MinVPS[B] NumAPs = ceil(S/256) packetPage = (B*64KB + Index*VPSize)>> Else <Packet does not fit in LPM> packetPage = OverflowAddress>> 8
[0138] If the packet fits in the LPM, the packetPage created is then the atomic page number within the LPM (there are up to 1 K different atomic pages in the LPM) into which the first data of the packet is stored. If the packet does not fit, then the packetPage is the contents of the configuration register OverflowAddress right-shifted 8 bits. The packet overflow mechanism is described elsewhere in this specification, with a subheader "Packet overflow".
[0139] In the cycle following the determination of where the packet will be stored, the new values of the AllocationMatrix, FitsVector and IndexVector must be recomputed for the selected block. If FitsVector[VPSize] is asserted, then IndexVector[VPSize] is the index of the largest non-allocated virtual page possible for the corresponding virtual page size. If FitsVector[VPSize] is de-asserted, then IndexVector[VPSize] is undefined.
[0140] The number of atomic pages needed to store the packet is calculated (NumAPs) and the corresponding atomic pages are allocated. The allocation of the atomic pages for the selected block (B) is done as follows:
[0141] 1. The allocation status of the atomic pages in AllocationMatrix[APsize][j . . . k], j being the first atomic page and k the last one (k-j+1=NumAPs), are set to allocated.
[0142] 2. The allocation status of the virtual pages in AllocationMatrix[r][s] are updated following the mesh structure in FIG. 4b. (a 2.sup.k+1-byte virtual page will be allocated if any of the two 2.sup.k-byte virtual pages that it is composed of is allocated).
[0143] When the packetPage has been generated, it is sent to the QS for enqueueing. If the QS is full (very rare), it will not be able to accept the packetPage being provided by the PMMU. In this case, the PMMU will not be able to generate a new packetPage for the next new packet. This puts pressure on the IB, which might get full if the QS remains full for several cycles.
[0144] The PMMU block also sends the queue number into which the QS has to store the packetPage. How the PMMU generates this queue number is described below in sections specifically allocated to the QS.
[0145] Page Allocation Example
[0146] FIGS. 5a and 5b illustrate an example of how atomic pages are allocated. For simplicity, the example assumes 2 blocks (0 and 1) of 2 KB each, with an Atomic page size of 256 bytes, and both blocks have their SoftwareOwned flag de-asserted. Single and double cross-hatched areas represent allocated virtual pages (single cross-hatched pages correspond to the pages being allocated in the current cycle). The example shows how the pages get allocated for a sequence of packet sizes of 256, 512, 1 K and 512 bytes. Note that, after this sequence, a 2 K-byte packet, for example, will not fit in the example LPM.
[0147] Whenever the FitsVector[VPSize] is asserted, the IndexVector[VPSize] contains the largest non-allocated virtual page index for virtual page size VPSize. The reason for choosing the largest index is that the memory space is better utilized. This is shown in FIGS. 6a and 6b, where two 256-byte packets are stored in a block. In scenario A, the 256-byte virtual page is randomly chosen, whereas in scenario B, the largest index is always chosen. As can be seen, the block in scenario A only allows two 512-byte virtual pages, whereas the block in scenario B allows three. Both, however, allow the same number of 256-byte packets since this is the smallest allocation unit. Note that the same effect is obtained by choosing the smallest virtual page index number all the time.
[0148] Packet Overflow
[0149] The only two reasons why a packet cannot be stored in the LPM are (a) that the size of the packet is larger than the maximum virtual page enabled across all 4 blocks; or (b) that the size of the packet is smaller than or equal to the maximum virtual page enabled but no space could be found in the LPM.
[0150] When a packet does not fit into the LPM, the PMMU will overflow the packet through the SIU into the EPM. To do so, the PMMU provides the initial address to the SIU (16-byte offset within the packet memory) to where the packet will be stored. This 20-bit address is obtained as follows: (a) The 16 MSB bits correspond to the 16 MSB bits of the OverflowAddress configuration register (i.e. the atomic page number within the packet memory). (b) The 4 LSB bits correspond to the HeaderGrowthOffset configuration register. The packetPage value (which will be sent to the QS) for this overflowed packet is then the 16 MSB bits of the OverflowAddress configuration register.
[0151] If the on-the-fly configuration flag OverflowEnable is asserted, the PMMU will generate an OverflowStartedInt interrupt. When the OverflowStartedint interrupt is generated, the size in bytes of the packet to overflow is written by the PMMU into the SPU-read-only configuration register SizeOfOverflowedPacket. At this point, the PMMU sets an internal lock flag that will prevent a new packet from overflowing. This lock flag is reset when the software writes into the on-the-fly configuration register OverflowAddress. If a packet needs to be overflowed but the lock flag is set, the packet will be dropped.
[0152] With this mechanism, it is guaranteed that only one interrupt will be generated and serviced per packet that is overflowed. This also creates a platform for software to decide where the starting address into which the next packet that will be overflowed will be stored is visible to the interrupt service routine through the SizeOfOverflowedPacket register. In other words, software manages the EPM.
[0153] If software writes the OverflowAddress multiple times in between two OverflowStartedInt interrupts, the results are undefined. Moreover, if software sets the 16 MSB bits of OverflowAddress to 0.1023, results are also undefined since the first 1 K atomic pages in the packet memory correspond to the LPM.
[0154] Downloading a Packet From Packet Memory
[0155] Eventually the SPU will complete the processing of a packet and will inform the QS of the fact. At this point the packet may be downloaded from memory, either LPM or EPM, and sent, via the OB to one of the connected devices. FIG. 7 is a top-level schematic of the blocks of the XCaliber DMS processor involved in the downloading of a packet, and the elements in FIG. 7 are numbered the same as in FIG. 2. The downloading process may be followed in FIG. 7 with the aid of the following descriptions.
[0156] When QS 211 is informed that processing of a packet is complete, the QS marks this packet as completed and, a few cycles later (depending on the priority of the packet), the QS provides to PMMU 209 (as long as the PMMU has requested it) the following information regarding the packet:
[0157] (a) the packetPage
[0158] (b) the priority (cluster number from which it was extracted)
[0159] (c) the tail growth/shrink information (described later in spec)
[0160] (d) the outbound device identifier bit
[0161] (e) the CRC type field (described later in spec)
[0162] (f) the KeepSpace bit
[0163] The device identifier sent to PMMU block 209 is a 1-bit value that specifies the external device to which the packet will be sent. This outbound device identifier is provided by software to QS 211 as a 2 -bit value.
[0164] If the packet was stored in LPM 219, PMMU 209 generates all of the (16-byte line) read addresses and read strobes to LPM 219. The read strobes are generated as soon as the read address is computed and there is enough space in OB 217 to buffer the line read from LPM 219. Bufferd in the OB is associated to device identifier d. This buffer may become full for either two reasons: (a) The external device d temporarily does not accept data from XCaliber; or (b) The rate of reading data from the OB is lower than the rate of writing data into it.
[0165] As soon as the packet data within an atomic page has all been downloaded and sent to the OB, that atomic page can be de-allocated. The de-allocation of one or more atomic pages follows the same procedure as described above. However, no de-allocation of atomic pages occurs if the LPM bit is de-asserted. The KeepSpace bit is a don't care if the packet resides in EPM 701.
[0166] If the packet was stored in EPM 701, PMMU 209 provides to SIU 107
the address within the EPM where the first byte of the packet resides. The SIU performs the downloading of the packet from the EPM. The SIU also monitors the buffer space in the corresponding buffer in OB 217 to determine whether it has space to write the 16-byte chunk read from EPM 701. When the packet is fully downloaded, the SIU informs the PMMU of the fact so that the PMMU can download the next packet with the same device identifier.
[0167] When two packets (one per device) are being simultaneously sent, data from the packet with highest priority is read out of the memory first. This preemption can happen at a 16-byte boundary or when the packet finishes its transmission. If both packets have the same priority (provided by the QS), a round-robin method is used to select the packet from which data will be downloaded next. This selection logic also takes into account how full the two buffers in the OB are. If buffer d is full, for example, no packet with a device identifier d will be selected in the PMMU for downloading the next 16-byte chunk of data.
[0168] When a packet starts to be downloaded from the packet memory (local or external), the PMMU knows where the first valid byte of the packet resides. However, the packet's size is not known until the first line (or the first two lines in some cases) of packet data is read from the packet memory, since the size of the packet resides in the first two bytes of the packet data. Therefore, the processing of downloading a packet first generates the necessary line addresses to determine the size of the packet, and then, if needed, generates the rest of the accesses.
[0169] This logic takes into account that the first two bytes that specify the size of the packet can reside in any position in the 16-byte line of data. A particular case is when the first two bytes span two consecutive lines (which will occur when the first byte is the 16th byte of a line, and second byte is the 1.sup.st byte of next line.
[0170] As soon as the PMMU finishes downloading a packet (all the data of that packet has been read from packet memory and sent to OB), the PMMU notifies the QS of this event. The QS then invalidates the corresponding packet from its queuing system.
[0171] When a packet starts to be downloaded, it cannot be preempted, i.e. the packet will finish its transmission. Other packets that become ready to be downloaded with the same outbound device identifier while the previous packet is being transmitted cannot be transmitted until the previous packet is fully transmitted.
[0172] Packet Growth/Shrink
[0173] As a result of processing a packet, the size of a network packet can grow, shrink or remain the same size. If the size varies, the SPU has to write the new size of the packet in the same first two bytes of the packet. The phenomenon of packet growth and shrink is illustrated in FIG. 8.
[0174] Both the header and the tail of the packet can grow or shrink. When a packet grows, the added data can overwrite the data of another packet that may have been stored right above the packet experiencing header growth, or that was stored right below in the case of tail growth. To avoid this problem the PMU can be configured so that an empty space is allocated at the front and at the end of every packet when it is stored in the packet memory. These empty spaces are specified with HeaderGrowthOffset and TailGrowthOffset boot-time configuration registers, respectively, and their granularity is 16 bytes. The maximum HeaderGrowthOffset is 240 bytes (15 16-byte chunks), and the maximum TailGrowthOffset is 1008 bytes (63 16-byte chunks). The minimum in both cases is 0 bytes. Note that these growth offsets apply to all incoming packets, that is, there is no mechanism to apply different growth offsets to different packets.
[0175] When the PMMU searches for space in the LPM, it will look for contiguous space of Size(packet)+((HeaderGrowthOffset+TailGrowthOffset)&l- t;<4). Thus, the first byte of the packet (first byte of the ASIC-specific header) will really start at offset ((packetPage<<8)+- (HeaderGrowthOffset<<4)) within the packet memory.
[0176] The software knows what the default offsets are, and, therefore, knows how much the packet can safely grow at both the head and the tail. In case the packet needs to grow more than the maximum offsets, the software has to explicitly move the packet to a new location in the packet memory. The steps to do this are as follows:
[0177] 1) The software requests the PMU for a chunk of contiguous space of the new size. The PMU will return a new packetPage that identifies (points to) this new space.
[0178] 2) The software writes the data into the new memory space.
[0179] 3) The software renames the old packetPage with the new packetPage.
[0180] 4) The software requests the PMU to de-allocate the space associated to the old packetPage.
[0181] In the case of header growth or shrinkage, the packet data will no longer start at ((packetPage<<8)+(HeaderGrowthOffset<<4)). The new starting location is provided to the PMU with a special instruction executed by the SPU when the processing of the packet is completed. This information is provided to the PMMU by the QS block.
[0182] Time Stamp
[0183] The QS block of the PMU (described in detail in a following section) guarantees the order of the incoming packets by keeping the packetPage identifiers of the packets in process in the XCaliber processor in FIFO-like queues. However, software may break this ordering by explicitly extracting identifiers from the QS, and inserting them at the tail of any of the queues.
[0184] To help software in guaranteeing the relative order of packets, the PMU can be configured to time stamp every packet that arrives to the PMMU block using an on-the-fly configuration flag TimeStampEnabled. The time stamp is an 8-byte value, obtained from a 64-bit counter that is incremented every core clock cycle.
[0185] When the time stamp feature is on, the PMMU appends the 8-byte time stamp value in front of each packet, and the time stamp is stripped off when the packet is sent to the network output interface. The time stamp value always occupies the 8 MSB bytes of the (k-1)th 16-byte chunk of the packet memory, where k is the 16-byte line offset where the data of the packet starts (k>0). In the case that HeaderGrowthOffset is 0, the time stamp value will not be appended, even if TimeStampEnabled is asserted.
[0186] The full 64-bit time counter value is provided to software through a read-only configuration register (TimeCounter).
[0187] Software Operations on the PMMU
[0188] Software has access to the PMMU to request or free a chunk of contiguous space. In particular, there are two operations that software can perform on the PMMU. Firstly the software, through an operation GetSpace(size), may try to find a contiguous space in the LPM for size bytes. The PMU replies with the atomic page number where the contiguous space that has been found starts (i.e. the packetpage), and a success bit. If the PMU was able to find space, the success bit is set to `1`, otherwise it is set to `0`. GetSpace will not be satisfied with memory of a block that has its SoftwareOwned configuration bit asserted. Thus, software explicitly manages the memory space of software-owned LPM blocks.
[0189] The PMMU allocates the atomic pages needed for the requested space. The EnableVector set of bits used in the allocation of atomic pages for incoming packets is a don't care for the GetSpace operation. In other words, as long as sufficient consecutive non-allocated atomic pages exist in a particular block to cover size bytes, the GetSpace(size) operation will succeed even if all the virtual pages in that block are disabled. Moreover, among non-software-owned blocks, a GetSpace operation will be served first using a block that has all its virtual pages disabled. If more than such a block exists, the smallest block number is chosen. If size is 0, GetSpace(size) returns `0`.
[0190] The second operation software can perform on the PMMU is FreeSpace(packetPage). In this operation the PMU de-allocates atomic pages that were previously allocated (starting at packetPage). This space might have been either automatically allocated by the PMMU as a result of an incoming packet, or as a result of a GetSpace command. FreeSpace does not return any result to the software. A FreeSpace operation on a block with its SoftwareOwned bit asserted is disregarded (nothing is done and no result will be provided to the SPU).
[0191] Local Packet Memory
[0192] Local Packet Memory (LPM), illustrated as element 219 in FIGS. 2
and 7, has in the instant embodiment a size of 256 KB, 16-byte line width with byte enables, 2 banks (even/odd), one Read and one Write port per bank, is fully pipelined, and has one cycle latency
[0193] The LPM in packet processing receives read and write requests from both the PMMU and the SIU. An LPM controller guarantees that requests from the PMMU have the highest priority. The PMMU reads at most one packet while writing another one. The LPM controller guarantees that the PMMU will always have dedicated ports to the LPM.
[0194] Malicious software could read/write the same data that is being written/read by the PMMU. Thus, there is no guarantee that the read and write accesses in the same cycle are performed to different 16-byte line addresses.
[0195] A request to the LPM is defined in this example as a single access (either read or write) of 16-bytes. The SIU generates several requests for a masked load or store, which are new instructions known to the inventors and the subject of at least one separate patent application. Therefore, a masked load/store operation can be stalled in the middle of these multiple requests if the highest priority PMMU access needs the same port.
[0196] When the PMMU reads or writes, the byte enable signals are assumed to be set (i.e. all 16 bytes in the line are either read or written). When the SIU drives the reads or writes, the byte enable signals are meaningful and are provided by the SIU.
[0197] When the SPU reads a single byte/word in the LPM, the SIU reads the corresponding 16-byte line and performs the extraction and right alignment of the desired byte/word. When the SPU writes a single byte/word, the SIU generates a 16-byte line with the byte/word in the correct location, plus the valid bytes signals.
[0198] Prioritization Among Operations
[0199] The PMMU may receive up to three requests from three different sources (IB, QS and software) to perform operations. For example, requests may come from the IB and/or Software: to perform a search for a contiguous chunk of space, to allocate the corresponding atomic page sizes and to provide the generated packetpage. Requests may also come from the QS and/or Software to perform the de-allocation of the atomic pages associated to a given packetpage.
[0200] It is required that the first of these operations takes no more than 2 cycles, and the second no more than one. The PMMU executes only one operation at a time. From highest to lowest, the PMMU block will give priority to requests from: IB, QS and Software.
[0201] Early full-PMMU Detection
[0202] The PMU implements a mechanism to aid in flow control between any external device and the XCaliber processor. Part of this mechanism is to detect that the LPM is becoming full and, in this case, a NoMorePagesOfXsizeInt interrupt is generated to the SPU. The EPM is software controlled and, therefore, its state is not maintained by the PMMU hardware.
[0203] The software can enable the NoMorePagesOfXsizeInt interrupt by specifying a virtual page size s. Whenever the PMMU detects that no more available virtual pages of that size are available (i.e. FitsVector[s] is de-asserted for all the blocks), the interrupt is generated. The larger the virtual page size selected, the sooner the interrupt will be generated. The size of the virtual page will be indicated with a 4-bit value (0:256 bytes, 1:512 bytes, . . . , 8:64 KB) in an on-the-fly configuration register IntIfNoMoreThanXsizePages. When this value is greater than 8, the interrupt is never generated.
[0204] If the smallest virtual page size is selected (256 bytes), the NoMorePagesOfXsizeInt interrupt is generated when the LPM is completely full (i.e. no more packets are accepted, not even a 1-byte packet).
[0205] In general, if the IntIfNoMoreThanXsizePages is X, the soonest the interrupt will be generated is when the local packet memory is (100/2.sup.X) % full. Note that, because of the atomic pages being 256
bytes, the LPM could become full with only 3 K-bytes of packet data (3
byte per packet, each packet using an atomic page).
[0206] Packet Size Mismatch
[0207] The PMMU keeps track of how many bytes are being uploaded into the LPM or EPM. If this size is different from the size specified in the first two bytes, a PacketErrorInt interrupt is generated to the SPU. In this case the packet with the mismatch packet size is dropped (the already allocated atomic pages will be de-allocated and no packetPage will be created). No AutomaticDropint interrupt is generated in this case. If the actual size is more than the size specified in the first two bytes, the remaining packet data being received from the ASIC is gracefully discarded.
[0208] When a packet size mismatch is detected on an inbound device identifier D (D=0,1), the following packets received from that same device identifier are dropped until software writes (any value) into a ClearErrorD configuration register.
[0209] Bus Error Recovering
[0210] Faulty packet data can arrive to or leave the PMU due to external bus errors. In particular the network input interface may notify that the 16-byte chunk of data sent in has a bus error, or the SIU may notify that the 16-byte chunk of data downloaded from EPM has a bus error. In both cases, the PMMU generates the PacketErrorInt interrupt to notify the SPU about this event. No other information is provided to the SPU.
[0211] Note that if an error is generated within the LPM, it will not be detected since no error detection mechanism is implemented in this on-chip memory. Whenever a bus error arises, no more data of the affected packet will be received by the PMU. This is done by the SIU in both cases. For the first case the PMMU needs to de-allocate the already allocated atomic pages used for the packet data received previous to the error event.
[0212] When a bus error is detected on an inbound device identifier D (D=0,1), the following packets received from that same device identifier are dropped until software writes (any value) into a ClearErrorD (D=0,1) configuration register.
[0213] Queuing System (QS)
[0214] The queueing system (QS) in the PMU of the XCaliber processor has functions of holding packet identifiers and the state of the packets currently in-process in the XCaliber processor, keeping packets sorted by their default or software-provided priority, selecting the packets that need to be pre-loaded (in the background) into one of the available contexts, and selecting those processed packets that are ready to be sent out to an external device.
[0215] FIG. 9 is a block diagram showing the high-level communication between the QS and other blocks in the PMU and SPU. When the PMMU creates a packetPage, it is sent to the QS along with a queue number and the device identifier. The QS enqueues that packetPage in the corresponding queue and associates a number (packetNumber) to that packet. Eventually, the packet is selected and provided to the RTU, which loads the packetPage, packetNumber and selected fields of the packet header into an available context. Eventually the SPU processes that context and communicates to the PMU, among other information, when the processing of the packet is completed or the packet has been dropped. For this communication, the SPU provides the packetNumber as the packet identifier. The QS marks that packet as completed (in the first case) and the packet is eventually selected for downloading from packet memory.
[0216] It is a requirement in the instant embodiment (and highly desirable) that packets of the same flow (same source and destination) need to be sent out to the external device in the same order as they arrived to the XCaliber processor (unless software explicitly breaks this ordering). When the SPU begins to process a packet the flow is not known. Keeping track of the ordering of packets within a flow is a costly task because of the amount of processing needed and because the number of active flows can be very large, depending on the application. Thus, the order within a flow is usually kept track by using aggregated-flow queues. In an aggregated-flow queue, packet identifiers from different flows are treated as from the same flow for ordering purposes.
[0217] The QS offloads the costly task of maintaining aggregated-flow queues by doing it in hardware and in the background. Up to 32
aggregated-flow queues can be maintained in the current embodiment, and each of these queues has an implicit priority. Software can enqueue a packetPage in any of the up to 32 queues, and can move a packetPage identifier from one queue to another (for example, when the priority of that packet is discovered by the software). It is expected that software, if needed, will enqueue all the packetPage identifiers of the packets that belong to the same flow into the same queue. Otherwise, a drop in the performance of the network might occur, since packets will be sent out of order within the same flow. Without software intervention, the QS guarantees the per-flow order of arrival.
[0218] Generic Queue
[0219] The QS implements a set of up to 32 FIFO-like queues, which are numbered, in the case of 32 queues, from 0 to 31. Each queue can have up to 256 entries. The addition of all the entries of all the queues, however, cannot exceed 256. Thus, queue sizes are dynamic. A queue entry corresponds to a packetPage identifier plus some other information. Up to 256 packets are therefore allowed to be in process at any given time in the XCaliber processor. This maximum number is not visible to software.
[0220] Whenever the QS enqueues a packetPage, a number (packetNumber) from 0 to 255 is assigned to the packetpage. This number is provided to the software along with the packetpage value. When the software wants to perform an operation on the QS, it provides the packetNumber identifier. This identifier is used by the QS to locate the packetPage (and other information associated to the corresponding packet) in and among its queues.
[0221] Software is aware that the maximum number of queues in the XCaliber processor is 32. Queues are disabled unless used. That is, the software does not need to decide how many queues it needs up front. A queue becomes enabled when at least one packet is in residence in that queue.
[0222] Several packet identifiers from different queues can become candidates for a particular operation to be performed. Therefore, some prioritization mechanism must exist to select the packet identifier to which an operation will be applied first. Software can configure (on-the-fly) the relative priority among the queues using an "on-the-fly" configuration register PriorityClusters. This is a 3-bit value that specifies how the different queues are grouped in clusters. Each cluster has associated a priority (the higher the cluster number, the higher the priority). The six different modes in the instant embodiment into which the QS can be configured are shown in the table of FIG. 10.
[0223] The first column of FIG. 10 is the value in the "on-the-fly" configuration register PriorityClusters. Software controls this number, which defines the QS configuration. For example, for PriorityClusters=2, the QS is configured into four clusters, with eight queues per cluster. The first of the four clusters will have queues 0 through 7, the second cluster will have queues 8-15, the third clusters 16 through 23, and the last of the four clusters has queues 24 through 31.
[0224] Queues within a cluster are treated fairly in a round robin fashion. Clusters are treated in a strict priority fashion. Thus, the only mode that guarantees no starvation of any queue is when PriorityClusters is 0, meaning one cluster of 32 queues.
[0225] Inserting a PacketPage/DeviceId Into the QS
[0226] FIG. 11 is a diagram illustrating the generic architecture of QS 211 of FIGS. 2 and 7 in the instant embodiment. Insertion of packetPages and DeviceId information is shown as arrows directed toward the individual queues (in this case 32 queues). The information may be inserted from three possible sources, these being the PMMU, the SPU and re-insertion from the QS. There exists priority logic, illustrated by function element 1101, for the case in which two or more sources have a packetpage ready to be inserted into the QS. In the instant embodiment the priority is, in descending priority order, the PMMU, the QS, and the SPU (software).
[0227] Regarding insertion of packets from the SPU (software), the software can create packets on its own. To do so, it first requests a consecutive chunk of free space of a given size (see the SPU documentation) from the PMU, and the PMU returns a packetPage in case the space is found. The software needs to explicitly insert that packetPage for the packet to be eventually sent out. When the QS inserts this packetPage, the packetNumber created is sent to the SPU. Software requests an insertion through the Command Unit (see FIG. 2).
[0228] In the case of insertion from the QS, an entry residing at the head of a queue may be moved to the tail of another queue. This operation is shown as selection function 1103.
[0229] In the case of insertion from the PMU, when a packet arrives to the XCaliber processor, the PMMU assigns a packetPage to the packet, which is sent to the QS as soon as the corresponding packet is safely stored in packet memory.
[0230] An exemplary entry in a queue is illustrated as element 1105, and has the following fields: Valid (1) validates the entry. PacketPage (16) is the first atomic page number in memory used by the packet. NextQueue (5) may be different from the queue number the entry currently belongs to, and if so, this number indicates the queue into which the packetpage needs to be inserted next when the entry reaches the head of the queue. Delta (10) contains the number of bytes that the header of the packet has either grown or shrunk. This value is coded in 2's complement. Completed (1) is a single bit that indicates whether software has finished the processing of the corresponding packet. DeviceId (2) is the device identifier associated to the packet. Before a Complete operation is performed on the packet (described below) the DeviceId field contains the device identifier of the external device that sent the packet in. After the Complete operation, this field contains the device identifier of the device to which the packet will be sent. Active (1) is a single bit that indicates whether the associated packet is currently being processed by the SPU. CRCtype (2) indicates to the network output interface which type of CRC, if any, needs to be computed for the packet. Before the Complete operation is performed on the packet, this field is 0. KeepSpace (1) specifies whether the atomic pages that the packet occupies in the LPM will be de-allocated (KeepSpace de-asserted) by the PMMU or not (KeepSpace asserted). If the packet resides in EPM this bit is disregarded by the PMMU.
[0231] The QS needs to know the number of the queue to which the packetPage will be inserted. When software inserts the packetPage, the queue number is explicitly provided by an XStream packet instruction, which is a function of the SPU, described elsewhere in this specification. If the packetPage is inserted by the QS itself, the queue number is the value of the NextQueue field of the entry where the packetPage resides.
[0232] When a packetPage is inserted by the PMMU, the queue number depends on how the software has configured (at boot time) the Log2InputQueues configuration register. If Log2InputQueues is set to 0, all the packetPages for the incoming packets will be enqueued in the same queue, which is specified by the on-the-fly configuration register FirstInputQueue. If Log2InputQueues is set to k(1<=k<=5), then the k MSB bits of the 3rd byte of the packet determine the queue number. Thus an external device (or the network input interface block of the SIU) can assign up to 256 priorities for each of the packets sent into the PMU. The QS maps those 256 priorities into 2.sup.k, and uses queue numbers FirstInputQueue to FirstInputQueue+2.sup.k-1 to insert the packetPages and deviceId information of the incoming packets.
[0233] It is expected that an external device will send the same 5 MSB bits in the 3.sup.rd byte for all packets in the same flow. Otherwise, a drop in the performance of the network might occur, since packets may be sent back to the external device out-of-order within the same flow. Software is aware of whether or not the external device (or SIU) can provide the information of the priority of the packet in the 3.sup.rd byte.
[0234] When packetPage p is inserted into queue q, the PacketPage field of the entry to be used is set to p and the Valid field to `1`. The value for the other fields depend on the source of the insertion. If the source is software (SPU), Completed is `0`; NextQueue is provided by SPU; DeviceId is `0`; Active is `1`; CRCtype is 0; KeepSpace is 0, and Probed is 0.
[0235] If the source is the QS, the remaining fields are assigned the value they have in the entry in which the to-be-inserted packetPage currently resides. If the source is the PMMU, Completed is `0`, NextQueue is q, DeviceId is the device identifier of the external device that sent the packet into XCaliber, Active is `0`, CRCtype is 0, KeepSpace is 0, and Probed is 0.
[0236] Monitoring Logic
[0237] The QS monitors entries into all of the queues to detect certain conditions and to perform the corresponding operation, such as to re-enqueue an entry, to send a packetPage (plus some other information) to the PMMU for downloading, or to send a packetPage (plus some other information) to the RTU.
[0238] All detections take place in a single cycle and they are done in parallel.
[0239] Re-enqueuing an Entry
[0240] The QS monitors all the head entities of the queues to determine whether a packet needs to be moved to another queue. Candidate entries to be re-enqueued need to be valid, be at the head of a queue, and have the NextQueue field value different from the queue number of the queue in which the packet currently resides.
[0241] If more than one candidate exists for re-enqueueing, the chosen entry will be selected following a priority scheme described later in this specification.
[0242] Sending an Entry to the PMMU for Downloading
[0243] The QS monitors all the head entities of the queues to determine whether a packet needs to be downloaded from the packet memory. This operation is 1102 in FIG. 11. The candidate entries to be sent out of XCaliber need to be valid, be at the head of the queue, have the NextQueue field value the same as the queue number of the queue in which the packet currently resides, and have the Completed flag asserted and the Active flag de-asserted. Moreover the QS needs to guarantee that no pending reads or writes exist from the same context that has issued the download command to the QS.
[0244] If more than one candidate exists for downloading, the chosen entry will be selected following a priority scheme described later in this specification.
[0245] A selected candidate will only be sent to the PMMU if the PMMU requested it. If the candidate was requested, the selected packetPage, along with the cluster number from which it is extracted, the tail growth/shrink, the outbound device identifier bit, the CRCtype and the KeepSpace bits are sent to the PMMU.
[0246] FIG. 12 is a table indicating coding of the DeviceId field. If the DeviceId field is 0, then the Outbound Device Identifier is the same as the Inbound Device Identifier, and so on as per the table.
[0247] When an entry is sent to the PMMU, the entry is marked as "being transmitted" and it is extracted from the queuing system (so that it does not block other packets that are ready to be transmitted and go to a different outbound device identifier). However, the entry is not invalidated until the PMMU notifies that the corresponding packet has been completely downloaded. Thus, probe-type operations on this entry will be treated as valid, i.e. as still residing in the XCaliber processor.
[0248] Reincarnation Effect
[0249] As described above, the QS assigns a packetnumber from 0 to 255
(256 numbers in total) to each packet that comes into XCaliber and is inserted into a queue. This is done by maintaining a table of 256 entries into which packet identifiers are inserted. At this time the Valid bit in the packet identifier is also asserted. Because the overall numbers of packets dealt with by XCaliber far exceeds 256, packet numbers, of course, have to be reused throughout the running of the XCaliber processor. Therefore, when packets are selected for downloading, at some point the packetNumber is no longer associated with a valid packet in process, and the number may be reused.
[0250] As long as a packet is valid in XCaliber it is associated with the packetNumber originally assigned. The usual way in which a packetNumber becomes available to be reused is that a packet is sent by the QS to the RTU for preloading in a context prior to processing. Then when the packet is fully processed and fully downloaded from memory, the packet identifier in the table associating packetNumbers is marked Invalid by manipulating the Valid bit (see FIG. 11 and the text accompanying).
[0251] In usual operation the system thus far described is perfectly adequate. It has been discovered by the inventors, however, that there are some situations in which the Active and Valid bits are not sufficient to avoid contention between streams. One of these situations has to do with a clean-up process, sometimes termed garbage collection, in which software monitors all packet numbers to determine when packets have remained in the system too long, and discards packets under certain conditions, freeing space in the system for newly-arriving packets.
[0252] In these special operations, like garbage collection, a stream must gain ownership of a packet, and assure that the operation it is to perform on the packet actually gets performed on the correct packet. As software probes packets, however, and before action may be taken, because there are several streams operating, and because the normal operation of the system may also send packets to the RTU, for example, it is perfectly possible in these special operations that a packet probed may be selected and effected by another stream before the special operation is completed. A packet, for example, may be sent to the RTU, processed, and downloaded, and a new packet may then be assigned to the packetNumber, and the new packet may even be stored at exactly the same packetPage as the original packet. There is a danger, then, that the special operations, such as discarding a packet in the garbage collection process, may discard a new and perfectly valid packet, instead of the packet originally selected to be discarded. This, of course, is just one of potentially many such special operations that might lead to trouble.
[0253] Considering the above, the inventors have provided a mechanism for assuring that, given two different absolute points in time, time s and time r, for example, that a valid packetNumber at time s and the same packetNumber at time r, still is associated to the same packet. A simple probe operation is not enough, because at some time after s and before time r the associated packet may be downloaded, and another (and different) packet may have arrived, been stored in exactly the same memory location as the previous packet, and been assigned the same packetNumber as the downloaded packet.
[0254] The mechanism implemented in XCaliber to ensure packetNumber association with a specific packet at different times includes a probe bit in the packet identifier. When a first stream, performing a process such as garbage collection, probes a packet, a special command, called Probe&Set is used. Probe&Set sets (asserts) the probe bit, and the usual information is returned, such as the value for the Valid bit, the Active bit, the packetPage address, and the old value of the probe bit. The first stream then executes a Conditional Activate instruction, described elsewhere in this specification, to gain ownership of the packet. Also, when the queuing system executes this Conditional Activate instruction it asserts the active bit of the packet. Now, at any time after the probe bit is set by the first stream, when a second stream at a later time probes the same packet, the asserted probe bit indicates that the first stream intends to gain control of this packet. The second stream now knows to leave this packet alone. This probe bit is de-asserted when a packet enters the XCaliber processor and a new (non-valid) number is assigned.
[0255] Sending an Entry to the RTU
[0256] The RTU uploads in the SPU background to the XCaliber processor some fields of the headers of packets that have arrived, and have been completely stored into packet memory. This uploading of the header of a packet in the background may occur multiple times for the same packet. The QS keeps track of which packets need to be sent to the RTU. The selection operation is illustrated in FIG. 11 as 1104.
[0257] Whenever the RTU has chosen a context to pre-load a packet, it notifies the QS that the corresponding packet is no longer an inactive packet. The QS then marks the packet as active.
[0258] Candidate entries to be sent to the RTU need to be valid, to be the oldest entry with the Active and Completed bits de-asserted, to have the NextQueue field value the same as the queue number of the queue in which the packet currently resides, and to conform to a limitation that no more than a certain number of packets in the queue in which the candidate resides are currently being processed in the SPU. More detail regarding this limitation is provided later in this specification. When an entry is sent to the RTU for pre-loading, the corresponding Active bit is asserted.
[0259] A queue can have entries with packet identifiers that already have been presented to the RTU and entries that still have not. Every queue has a pointer (NextPacketForRTU) that points to the oldest entry within that queue that needs to be sent to the RTU. Within a queue, packet identifiers are sent to the RTU in the same order they were inserted in the queue.
[0260] The candidate packet identifiers to be sent to the RTU are those pointed to by the different NextPacketForRTU pointers associated with the queues. However, some of these pointers might point to a non-existent entry (for example, when the queue is empty or when all the entries have already been sent to the RTU). The hardware that keeps track of the state of each of the queues determines these conditions. Besides being a valid entry pointed to by a NextPacketForRTU pointer, the candidate entry needs to have associated with it an RTU priority (described later in this specification) currently not being used by another entry in the RTU. If more than a single candidate exists, the chosen entry is selected following a priority scheme described later in this specification.
[0261] As opposed to the case in which an entry is sent to the PMMU for downloading, an entry sent to the RTU is not extracted from its queue. Instead, the corresponding NextPacketForRTU pointer is updated, and the corresponding Active bit is asserted.
[0262] The QS sends entries to an 8-entry table in the RTU block as long as the entry is a valid candidate and the corresponding slot in the RTU table is empty. The RTU will accept, at most, 8 entries, one per each interrupt that the RTU may generate to the SPU.
[0263] The QS maps the priority of the entry (given by the queue number where it resides) that it wants to send to the RTU into one of the 8
priorities handled by the RTU (RTU priorities). This mapping is shown in the table of FIG. 13, and it depends on the number of clusters into which the different queues are grouped (configuration register PriorityClusters) and the queue number in which the entry resides.
[0264] The RTU has a table of 8 entries, one for each RTU priority. Every entry contains a packet identifier (packetPage, packetNumber, queue#) and a Valid bit that validates it. The RTU always accepts a packet identifier of RTU priority p if the corresponding Valid bit in entry p of that table is de-asserted. When the RTU receives a packet identifier of RTU priority p from the QS, the Valid bit of entry p in the table is asserted, and the packet identifier is stored. At that time the QS can update the corresponding NextPacketForRTU pointer.
[0265] Limiting the Packets Sent Within a Queue
[0266] Software can limit the number of packets that can be active (i.e. being processed by any of the streams in the SPU) on a per-queue basis. This is achieved through a MaxActivePackets on-the-fly configuration register, which specifies, for each queue, a value between 1 and 256 that corresponds to the maximum number of packets, within that queue, that can be being processed by any stream.
[0267] The QS maintains a counter for each queue q which keeps track of the current number of packets active for queue q. This counter is incremented whenever a packet identifier is sent from queue q to the RTU, a Move operation moves a packet into queue q, or an Insert operation inserts a packet identifier into queue q; and decremented when any one the following operations are performed in any valid entry in queue q: a Complete operation, an Extract operation, a Move operation that moves the entry to a different queue, or a MoveAndReactivate operation that moves the entry to any queue (even to the same queue). Move, MoveAndReactivate, Insert, Complete and Extract are operations described elsewhere in this specification.
[0268] Whenever the value of the counter for queue q is equal to or greater than the corresponding maximum value specified in the MaxActivePackets configuration register, no entry from queue q is allowed to be sent to the RTU. The value of the counter could be greater since software can change the MaxActivePackets configuration register for a queue to a value lower than the counter value at the time of the change, and a queue can receive a burst of moves and inserts.
[0269] Software Operations on the QS
[0270] Software executes several instructions that affect the QS. The following is a list of all operations that can be generated to the QS as a result of the dispatch by the SPU core of an XStream packet instruction:
[0271] Insert(p,q): the packetPage p is inserted into queue q. A `1` will be returned to the SPU if the insertion was successful, and a `0` if not. The insertion will be unsuccessful only when no entries are available (i.e. when all the 256 entries are valid).
[0272] Move (n,q): asserts to q the NextQueue field of the entry in which packetNumber n resides.
[0273] MoveAndReactivate(n,q): asserts to q the NextQueue field of the entry in which packetNumber n resides; de-asserts the Active bit.
[0274] Complete(n,d,e): asserts the Completed flag, the Delta field to d and the deviceId field to e of the entry in which packetNumber n resides. De-asserts the Active bit and de-asserts the KeepSpace bit.
[0275] CompleteAndKeepSpace(n,d,e): same as Complete( ) but it asserts the KeepSpace bit.
[0276] Extract(n): resets the Valid flag of the entry in which packetNumber n resides.
[0277] Replace(n,p): the PacketPage field of the entry in which packetNumber n resides is set to packetPage p.
[0278] Probe(n): the information whether the packetNumber n exists in the QS or not is returned to the software. In case it exists, it returns the PacketPage, Completed, NextQueue, DeviceId, CRCtype, Active, KeepSpace and Probed fields.
[0279] ConditionalActivate(n): returns a `1` if the packetNumber n is valid, Probed is asserted, Active is de-asserted, and the packet is not being transmitted. In this case, the Active bit is asserted.
[0280] The QS queries the RTU to determine whether the packet identifier of the packet to be potentially activated is in the RTU table, waiting to be preloaded, or being preloaded. If the packet identifier is in the table, the RTU invalidates it. If the query happens simultaneously with the start of preloading of that packet, the QS does not activate the packet.
[0281] ProbeAndSet(n): same as Probe( ) but it asserts the Probed bit (the returned Probed bit is the old Probed bit).
[0282] Probe(q): provides the size (i.e. number of valid entries) in queue q.
[0283] A Move ( ), MoveAndReactivate( ), Complete( ), CompleteAndKeepSpace( ), Extract( ) and Replace( ) on an invalid (i.e. non-existing) packetNumber is disregarded (no interrupt is generated).
[0284] A Move, MoveAndReactivate, Complete, CompleteAndKeepSpace, Extract and Replace on a valid packetNumber with the Active bit de-asserted should not happen (guaranteed by software). If it happens, results are undefined. Only the Insert, Probe, ProbeAndSet and ConditionalActivate operations reply back to the SPU.
[0285] If software issues two move-like operations to the PMU that affect the same packet, results are undefined, since there is no guarantee that the moves will happen as software specified.
[0286] FIG. 14 is a table showing allowed combinations of Active, Completed, and Probed bits for a valid packet.
[0287] Basic Operations
[0288] To support the software operations and the monitoring logic, the QS implements the following basic operations:
[0289] 1. Enqueue an entry at the tail of a queue.
[0290] 2. Dequeue an entry from the queue in which it resides.
[0291] 3. Move an entry from the head of the queue wherein it currently resides to the tail of another queue.
[0292] 4. Provide an entry of a queue to the RTU.
[0293] 5. Provide the size of a queue.
[0294] 6. Update any of the fields associated to packetNumber.
[0295] Operations 1, 2, 4 and 6 above (applied to different packets at the same time) are completed in 4 cycles in a preferred embodiment of the present invention. This implies a throughput of one operation per cycle.
[0296] Some prioritization is necessary when two or more operations could start to be executed at the same time. From highest to lowest priority, these events are inserting from the PMMU, dequeuing an entry, moving an entry from one queue to another queue, sending an entry to the RTU for pre-loading, or a software operation. The prioritization among the software operations is provided by design since software operations are always executed in order.
[0297] Early QS Full Detection
[0298] The PMU implements a mechanism to aid in flow control between the ASIC (see element 203 in FIG. 2) and the XCaliber processor. Part of this mechanism is to detect that the QS is becoming full and, in this case, a LessThanXpacketIdEntriesInt interrupt is generated to the SPU. The software can enable this interrupt by specifying (in a IntIfLessThanXpacketIdEntries configuration register) a number z larger than 0. An interrupt is generated when 256-y<z, being y the total number of packets currently in process in XCaliber. When z=0, the interrupt will never occur.
[0299] Register Transfer Unit (RTU)
[0300] A goal of the RTU block is to pre-load an available context with information of packets alive in XCaliber. This information is the packetPage and packetNumber of the packet and some fields of its header. The selected context is owned by the PMU at the time of the pre-loading, and released to the SPU as soon as it has been pre-loaded. Thus, the SPU does not need to perform the costly load operations to load the header information and, therefore, the overall latency of processing packets is reduced.
[0301] The RTU receives from the QS a packet identifier (packetPage, packetNumber) and the number of the queue from which the packet comes from) from the QS. This identifier is created partly by the PMMU as a result of a new packet arriving to XCaliber through the network input interface (packetPage), and partly by the QS when the packetPage and device identifier are enqueued (packetNumber).
[0302] Another function of the RTU is to execute masked load/store instructions dispatched by the SPU core since the logic to execute a masked load/store instruction is similar to the logic to perform a pre-load. Therefore, the hardware can be shared for both operations. For this reason, the RTU performs either a masked load/store or a pre-load, but not both, at a time. The masked load/store instructions arrive to the RTU through the command queue (CU) block.
[0303] Context States
[0304] A context can be in one of two states: PMU-owned or SPU-owned. The ownership of a context changes when the current owner releases the context. The PMU releases a context to the SPU in three cases. Firstly, when the RTU has finished pre-loading the information of the packet into the context. Secondly, the PMU releases a context to the SPU when the SPU requests a context to the RTU. In this case, the RTU will release a context if it has one available for releasing. Thirdly, all eight contexts are PMU-owned. Note that a context being pre-loaded is considered to be a PMU-owned context.
[0305] The SPU releases a context to the RTU when the SPU dispatches an XStream RELEASE instruction.
[0306] Pre-loading a Context