United States Patent Application20040148382
Kind CodeA1
Narad, Charles E. ; et al.July 29, 2004

Language for handling network packets
Abstract
The present invention relates to a general-purpose programmable packet-processing platform for accelerating network infrastructure applications which have been structured so as to separate the stages of classification and action. Network packet classification, execution of actions upon those packets, management of buffer flow, encryption services, and management of Network Interface Controllers are accelerated through the use of a multiplicity of specialized modules. A language interface is defined for specifying both stateless and stateful classification of packets and to associate actions with classification results in order to efficiently utilize these specialized modules.

Inventors:Narad; Charles E. (Santa Clara, CA), Fall; Kevin  (Berkeley, CA), MacAvoy; Neil  (Redwood City, CA), Shankar; Pradip  (Fremont, CA), Rand; Leonard M.  (San Francisco, CA), Hall; Jerry J.  (Santa Clara, CA)
Correspondence Name and Address:12400 WILSHIRE BOULEVARD, SEVENTH FLOOR
BLAKELY SOKOLOFF TAYLOR & ZAFMAN
LOS ANGELES
CA
90025
US
Series Code:748311
Filed:December 29, 2003
U.S. Current Class:709/223
U.S. Class at Publication:709/223
Intern'l Class:G06F 015/173

Claims


What is claimed is:
1. A method of checking cumulative status of a plurality of arithmetic operations, the method comprising: initializing a first condition code to a first value; performing the plurality of arithmetic operations, a result of at least one of the plurality of arithmetic operations being capable of indicating whether a criterion is met; if the result of at least one of the plurality of arithmetic operations indicates the criterion is met, then initializing the first condition code to a second value; keeping the first condition code unchanged for a remainder of the plurality of arithmetic operations once the first condition code is initialized to the second value; and performing a test on the first condition code, wherein a status of the first condition code indicates a cumulative status of the performed plurality of arithmetic operations.

2. The method of claim 1 wherein the criterion is an item selected from a list comprising the result being non-zero, the result being zero, the result being greater than zero, and the result being less than zero.

3. The method of claim 1 wherein the first condition code is initialized by a non-arithmetic operation.

4. The method of claim 1 wherein the plurality of arithmetic operations are selected from a group comprising a comparison operation and a subtract operation.

5. The method of claim 1 wherein the result of at least one of the plurality of arithmetic operations returns an item selected from a list comprising data equal non-zero, data equal zero, data greater than zero, and data less than zero.

6. The method of claim 1 wherein the first value is non-zero.

7. The method of claim 1 wherein the second value is a zero.

8. An apparatus to check cumulative status of a plurality of arithmetic operations, the apparatus comprising: first initializing means to initialize a first condition code to a first value; processing means to perform the plurality of arithmetic operations, a result of at least one of the plurality of arithmetic operations being capable of indicating whether a criterion is met; second initializing means to initialize the first condition code to a second value; and test means to perform a test on the first condition code, wherein the first condition code remains unchanged for a remainder of the plurality of arithmetic operations once the first condition code is initialized to the second value.

9. The apparatus of claim 8 wherein the first condition code is initialized by a non-arithmetic operation.

10. The apparatus of claim 8 wherein the plurality of arithmetic operations are selected from a group comprising a comparison operation and a subtract operation.

11. The apparatus of claim 8 wherein the first value is non-zero.

12. The apparatus of claim 8 wherein the second value is a zero.

13. A system comprising: at least one classification engine to classify a selected portion of a plurality of packets; and an apparatus to check cumulative status of a plurality of arithmetic operations comprising: a first facility to initialize a first condition code to a first value; a second facility to perform the plurality of arithmetic operations, a result of at least one of the plurality of arithmetic operations being capable of indicating whether a criterion is met; and a third facility to initialize the first condition code to a second value if the result of at least one of the plurality of arithmetic operations indicates the criterion is met.

14. The system of claim 13 further including a fourth facility to perform a test on the first condition code.

15. The system of claim 13 wherein once the first condition code is initialized to the second value the first condition code remains unchanged for a remainder of the plurality of arithmetic operations.

16. The system of claim 13 wherein the plurality of arithmetic operations are selected from a group comprising a comparison operation and a subtract operation.

17. The system of claim 13 wherein the first value is non-zero.

18. The system of claim 13 wherein the second value is a zero.

19. The system of claim 13 wherein the classification engine includes a micro-programmed processor.

20. The system of claim 19 wherein the micro-programmed processor selectively processes the selected portion of the plurality of packets by performing thereon at least a subset of packet-based operations including packet header parsing and table lookups.

21. The system of claim 20 wherein the table lookups utilize hash tables.

22. The system of claim 13 wherein a classified packet is returned to the classification engine to be reclassified.

23. The system of claim 13 wherein the classification engine receives a plurality of classification policies to indicate how the classification engine classifies a packet based on select information from a group comprising packet header parsing and table lookups.

24. The system of claim 23 wherein the classification policies are supplied dynamically from an application processor.

25. The system of claim 13 further including an application processor having a host interface.

26. The system of claim 13 further comprising a plurality of data buffers to store data utilized by the system.

27. The system of claim 13 further including an embedded processor to provide processing capabilities to the system.

28. A machine-readable medium that provides instructions which, when executed by a machine, cause the machine to perform operations comprising: initializing a first condition code to a first value; performing a plurality of arithmetic operations, a result of at least one of the plurality of arithmetic operations being capable of indicating whether a criterion is met; if the result of at least one of the plurality of arithmetic operations indicates the criterion is met, then initializing the first condition code to a second value; and keeping the first condition code unchanged for a remainder of the plurality of arithmetic operations once the first condition code is initialized to the second value.

29. The medium of claim 28 further performing a test on the first condition code.

30. The medium of claim 28 wherein a status of the first condition code indicates a cumulative status of the performed plurality of arithmetic operations.

31. The medium of claim 28 wherein the plurality of arithmetic operations are selected from a group comprising a comparison operation and a subtract operation.

32. The medium of claim 28 wherein the first value is non-zero.

33. The medium of claim 28 wherein the second value is a zero.

34. An apparatus to check cumulative status of a plurality of arithmetic operations comprising: a first facility to initialize a first condition code to a first value; a second facility to perform the plurality of arithmetic operations, a result of at least one of the plurality of arithmetic operations being capable of indicating whether a criterion is met; a third facility to initialize the first condition code to a second value if the result of at least one of the plurality of arithmetic operations indicates the criterion is met; and a fourth facility to perform a test on the first condition code, wherein once the first condition code is initialized to the second value the first condition code remains unchanged for a remainder of the plurality of arithmetic operations.

35. The apparatus of claim 34 wherein the plurality of arithmetic operations are selected from a group comprising a comparison operation and a subtract operation.

36. The apparatus of claim 34 wherein the first value is non-zero.

37. The apparatus of claim 34 wherein the second value is a zero.

38. The apparatus of claim 34 further including a micro-programmed processor.

39. The apparatus of claim 34 further including an application processor having a host interface.

40. The apparatus of claim. 34 further comprising a plurality of data buffers to store data utilized by the apparatus.

41. The apparatus of claim 34 further including an embedded processor to provide processing capabilities to the apparatus.

42. A method of checking cumulative status of a plurality of arithmetic operations, the method comprising: initializing a first condition code to a first value; performing the plurality of arithmetic operations, a result of at least one of the plurality of arithmetic operations being capable of indicating whether a criterion is met; if the result of at least one of the plurality of arithmetic operations indicates the criterion is met, then initializing the first condition code to a second value; and performing a test on the first condition code, wherein a status of the first condition code indicates a cumulative status of the performed plurality of arithmetic operations.

43. The method of claim 42 wherein once the first condition code is initialized to the second value the first condition code remains unchanged for a remainder of the plurality of arithmetic operations.

44. The method of claim 42 wherein the criterion is an item selected from a list comprising the result being non-zero, the result being zero, the result being greater than zero, and the result being less than zero.

45. The method of claim 42 wherein the first condition code is initialized by a non-arithmetic operation.

46. The method of claim 42 wherein the plurality of arithmetic operations are selected from a group comprising a comparison operation and a subtract operation.

47. The method of claim 42 wherein the result of at least one of the plurality of. arithmetic operations returns an item selected from a list comprising data equal non-zero, data equal zero, data greater than zero, and data less than zero.

48. The method of claim 42 wherein the first value is non-zero.

49. The method of claim 42 wherein the second value is a zero.

50. An apparatus to check cumulative status of a plurality of arithmetic operations, the apparatus comprising: a first initializer to initialize a first condition code to a first value; a processor to perform the plurality of arithmetic operations, a result of at least one of the plurality of arithmetic operations being capable of indicating whether a criterion is met; a second initializer to initialize the first condition code to a second value; and a tester to perform a test on the first condition code, wherein the first condition code remains unchanged for a remainder of the plurality of arithmetic operations once the first condition code is initialized to the second value.

51. The apparatus of claim 50 wherein the first condition code is initialized by a non-arithmetic operation.

52. The apparatus of claim 50 wherein the plurality of arithmetic operations are selected from a group comprising a comparison operation and a subtract operation.

53. The apparatus of claim 50 wherein the first value is non-zero.

54. The apparatus of claim 50 wherein the second value is a zero.

55. An apparatus to check cumulative status of a plurality of arithmetic operations comprising: a first initializer to initialize a first condition code to a first value; a first circuit to perform the plurality of arithmetic operations, a result of at least one of the plurality of arithmetic operations being capable of indicating whether a criterion is met; a second initializer to initialize the first condition code to a second value if the result of at least one of the plurality of arithmetic operations indicates the criterion is met; and a second circuit to perform a test on the first condition code, wherein once the first condition code is initialized to the second value the first condition code remains unchanged for a remainder of the plurality of arithmetic operations.

56. The apparatus of claim 55 wherein the plurality of arithmetic operations are selected from a group comprising a comparison operation and a subtract operation.

57. The apparatus of claim 55 wherein the first value is non-zero.

58. The apparatus of claim 55 wherein the second value is a zero.

59. The apparatus of claim 55 further including a micro-programmed processor.

60. The apparatus of claim 59 wherein the first and second circuits utilize the micro-programmed processor to perform their tasks.

61. The apparatus of claim 55 further including an application processor having a host interface.

62. The apparatus of claim 55 further comprising a plurality of data buffers to store data utilized by the apparatus.

63. The apparatus of claim 55 further including an embedded processor to provide processing capabilities to the apparatus.

64. The apparatus of claim 63 wherein the first and second circuits utilize the embedded processor to perform their tasks.

Description



FIELD OF THE INVENTION

[0001] The present invention relates to computer networks and, more particularly, to a general purpose programmable platform for acceleration of network infrastructure applications.

BACKGROUND OF THE INVENTION

[0002] Computer networks have become a key part of the corporate infrastructure. Organizations have become increasingly dependent on intranets and the Internet and are demanding much greater levels of performance from their network infrastructure. The network infrastructure is being viewed: (1) as a competitive advantage; (2) as mission critical; (3) as a cost center. The infrastructure itself is transitioning from 10
Mb/s (megabits per second) capability to 100 Mb/s capability. Soon, infrastructure capable of 1 Gb/s (gigabits per second) will start appearing on server connections, trunks and backbones. As more and more computing equipment gets deployed, the number of nodes within an organization has also grown. There has been a doubling of users, and a ten-fold increase in the amount of traffic every year.

[0003] Network infrastructure applications monitor, manage and manipulate network traffic in the fabric of computer networks. The high demand for network bandwidth and connectivity has led to tremendous complexity and performance requirements for this class of application. Traditional methods of dealing with these problems are no longer adequate.

[0004] Several sophisticated software applications that provide solutions to the problems encountered by the network manager have emerged. The main areas for such applications are Security, Quality of Service (QoS)/Class of Service (CoS) and Network Management. Examples are: Firewalls; Intrusion Detection; Encryption; Virtual Private Networks (VPN); enabling services for ISPs (load balancing and such); Accounting; Web billing; Bandwidth Optimization; Service Level Management; Commerce; Application Level Management; Active Network Management

[0005] There are three conventional ways in which these applications are deployed:

[0006] (1) On general purpose computers.

[0007] (2) Using single function boxes.

[0008] (3) On switches and routers.

[0009] It is instructive to examine the issues related to each of these deployment techniques.

1. General Purpose Computers

[0010] General Purpose computers, such as PCs running NT/Windows or workstations running Solaris/HP-UX, etc. are a common method for deploying network infrastructure applications. The typical configuration consists of two or more network interfaces each providing a connection to a network segment. The application runs on the main processor (Pentium/SPARC etc.) and communicates with the Network Interface Controller (NIC) card either through (typically) the socket interface or (in some cases) a specialized driver "shim" in the operating system (OS). The "shim" approach allows access to "raw" packets, which is necessary for many of the packet oriented applications. Applications that are end-point oriented, such as proxies can interface to the top of the IP (Internet Protocol) or other protocol stack.

[0011] The advantages of running the application on a general purpose computer include: a full development environment; all the OS services (IPC, file system, memory management, threads, I/O etc); low cost due to ubiquity of the platform; stability of the APIs; and assurance that performance will increase with each new generation of the general purpose computer technology.

[0012] There are, however, many disadvantages of running the application on a general purpose computer. First, the I/O subsystem on a general purpose computer is optimized to provide a standard connection to a variety of peripherals at reasonable cost and, hence, reasonable performance. 32 b/33 MHz PCI ("Peripheral Connection Interface", the dominant I/O connection on common general purpose platforms today) has an effective bandwidth in the 50-75 MB/s range. While this is adequate for a few interfaces to high performance networks, it does not scale. Also, there is significant latency involved in accesses to the card. Therefore, any kind of non-pipelined activity results in a significant performance impact.

[0013] Another disadvantage is that general purpose computers do not typically have good interrupt response time and context switch characteristics (as opposed to real-time operating systems used in many embedded applications). While this is not a problem for most computing environments, it is far from ideal for a network infrastructure application. Network infrastructure applications have to deal with network traffic operating at increasingly higher speeds and less time between packets. Small interrupt response times and small context switch times are very necessary.

[0014] Another disadvantage is that general purpose platforms do not have any specialized hardware that assist with network infrastructure applications. With rare exception, none of the instruction sets for general purpose computers are optimized for network infrastructure applications.

[0015] Another disadvantage is that, on a general purpose computer, typical network applications are built on top of the TCP/IP stack. This severely limits the packet processing capability of the application.

[0016] Another disadvantage is that packets need to be pulled into the processor cache for processing. Cache fills and write backs become a severe bottleneck for high bandwidth networks.

[0017] Finally, general purpose platforms use general purpose operating systems (OS's). These operating systems are generally not known for having quick reboots on power-cycle or other wiring-closet appliance oriented characteristics important for network infrastructure applications.

2. Fixed-Function Appliances

[0018] There are a couple of different ways to build single function appliances. The first way is to take a single board computer, add in a couple of NIC cards, and run an executive program on the main processor. This approach avoids some of the problems that a general purpose OS brings, but the performance is still limited to that of the base platform architecture (as described above).

[0019] A way to enhance the performance is to build special purpose hardware that performs functions required by the specific application very well. Therefore from a performance standpoint, this can be a very good approach.

[0020] There are, however, a couple of key issues with special function appliances. For example, they are not expandable by their very nature. If the network manager needs a new application, he/she will need to procure a new appliance. Contrast this with loading a new application on a desktop PC. In the case of a PC, a new appliance is not needed with every new application. Finally, if the solution is not completely custom, it is unlikely that the solution is scalable. Using a PC or other single board computer as the packet processor for each location at which that application is installed is not cost-effective.

3. Switches and Routers

[0021] Another approach is to deploy a scaled down version of an application on switches and routers which comprise the fabric of the network. The advantages of this approach are that: (1) no additional equipment is required for the deployment of the application; and (2) all of the segments in a network are visible at the switches.

[0022] There are a number of problems with this approach.

[0023] One disadvantage is that the processing power available at a switch or router is limited. Typically, this processing power is dedicated to the primary business of the switch/router--switching or routing. When significant applications have to be run on these switches or routers, their performance drops.

[0024] Another disadvantage is that not all nodes in a network need to be managed in the same way. Putting significant processing power on all the ports of a switch or router is not cost-effective.

[0025] Another disadvantage is that, even if processing power became so cheap as to be deployed freely at every port of a switch or router, a switch or router is optimized to move frames/packets from port to port. It is not optimized to process packets, for applications.

[0026] Another disadvantage is that a typical switch or router does not provide the facilities that are necessary for the creation and deployment of sophisticated network infrastructure applications. The services required can be quite extensive and porting an application to run on a switch or router can be very difficult.

[0027] Finally, replacing existing network switching equipment with new versions that support new applications can be difficult. It is much more effective to "add applications" to the network where needed.

[0028] What is needed is an optimized platform for the deployment of sophisticated software applications in a network environment.

SUMMARY

[0029] The present invention relates to a general-purpose programmable packet processing platform for accelerating network infrastructure applications which have been structured so as to separate the stages of classification and action. A wide variety of embodiments of the present invention are possible and will be understood by those skilled in the art based on the present patent application. In certain embodiments, acceleration is achieved by one or more of the following:

[0030] Dividing the steps of packet processing into a multiplicity of pipeline stages and providing different functional units for different stages, thus allowing more processing time per packet and also providing concurrency in the processing of multiple packets,

[0031] Providing custom, specialized Classification Engines which are micro-programmed processors optimized for the various functions common in predicate analysis and table searches for these sort of applications, and are each used as pipeline stages in different flows,

[0032] Providing a general-purpose microprocessor for executing the arbitrary actions desired by these applications,

[0033] Providing a tightly-coupled encryption coprocessor to accelerate common network encryption functions,

[0034] Reducing or eliminating the need for the applications to examine the actual contents of the packet, thus minimizing the movement of packet data and the effects of that data movement on the processor's cache/bus/memory subsystem, and

[0035] Either eliminating or providing special hardware to accelerate system overheads common to embedded network applications run on general purpose platforms; this includes special support for managing buffer pools, for communication among units and the passing of buffers between them, and for managing the network interface MACs (media access controllers) without the need for heavyweight device driver programs.

[0036] Recognizing a common policy enforcement module for network infrastructure applications

[0037] Certain specific embodiments are implemented with one or more of the following features:

[0038] a policy enforcement module consisting of Classification and associated Action

[0039] both stateless classification and stateful classification which uses sets

[0040] Provision of a high level interface to packet level Classification and Action (Action and Classification Engine--ACE)

[0041] Provision of the high level interface within common operating environments

[0042] Policy can be changed dynamically

[0043] Application partitioned into an AP module running on the AP (Application Processor) and a PE (Policy Engine) module running on the PE

[0044] AP can run operating systems with full services to facilitate application development

[0045] PE functionality embodied as software running on AP as well as hardware and software running on the hardware PE

[0046] A language interface to describe Classification and to associate Actions with the results of the Classification

[0047] Language (NetBoost Classification Language--NCL) for Classification/Action

[0048] Object oriented (extensible)

[0049] Specific to Classification and hence very simple

[0050] Built-in intrinsics such as checksum

[0051] Language constructs make it easy to describe layered protocols and protocol fields

[0052] Rule construct to associate Classification and Actions

[0053] Predicate construct which is a function of packet contents at any layer of any protocol and/or of hash search results

[0054] Set construct to describe hash tables and multiple searches on the same hash table

[0055] Action code

[0056] Written in high level language

[0057] Complex packet processing possible

[0058] Can avail of Application Services Library (ASL) providing services useful for packet processing

[0059] ASL consists of packet management, memory management, time and event management, link level services, packet timestamp service, cryptographic services, communication services to AP module plus extensions

[0060] TCP/IP extensions include services such as Network Address Translation (NAT) for IP, TCP and UDP, Checksums, IP fragment reassembly and TCP segment reassembly

[0061] System components include

[0062] library implementing API (DLL under Windows NT)

[0063] a management process called Resolver

[0064] an incremental compiler for NCL

[0065] linker for NCL code

[0066] dynamic linker for action code operating-system specific drivers which communicate with both hardware and software PEs

[0067] software Policy Engine that executes Classification and Action code ASL for Action code

[0068] management services (Resolver and Plumber) for both application developer and the end-user

[0069] development environment for AP and PE code including compilers, and other software development tools familiar to those skilled in the art

[0070] ACE

[0071] C++ object which abstracts the packet processing associated with an application or sub-application

[0072] Provides a context for Classification and Action

[0073] Contains one or more Target objects, including drop and default, which represent packet destinations

[0074] Provides a context for upcalls and downcalls between the AP and the PE modules

[0075] Targets of an ACE are connected to other ACEs or interfaces using the Plumber (graphical and programmatic interfaces) to specify the serialization of ACE processing

[0076] Operating environment for action code

[0077] Invokes actions automatically when associated classification succeeds

[0078] Implements an ACE context

[0079] Low overhead (soft real-time) environment

[0080] Handles communication between AP and PE

[0081] Performs dynamic linking of action code when ACEs are loaded with new Classification code

[0082] Resolver

[0083] Maintains namespace of applications, interfaces and ACEs

[0084] Maps ACEs to PEs automatically

[0085] Contains the compiler for NCL and does dynamic compilation of NCL

[0086] Provides the interfaces for management of applications, ACEs and interfaces

[0087] Compiler for NCL

[0088] Generates code for multiple processors (AP and PE)

[0089] Allows incremental compilation of rules

[0090] Plumber

[0091] Allows interconnection of ACEs

[0092] Allow binding to interfaces

[0093] Supports secure remote access

BRIEF DESCRIPTION OF THE DRAWINGS

[0094] FIG. 1 is a block diagram of a system in accordance with the present invention.

[0095] FIG. 2 is a block diagram showing packet flow according to an embodiment of the present invention.

[0096] FIG. 3 is a Policy Engine ASIC block diagram according to the present invention.

[0097] FIG. 4 is a sample system-level block diagram related to the present invention.

[0098] FIG. 5 shows a ring array in memory related to the present invention.

[0099] FIG. 6 shows an RX Ring Structure related to the present invention.

[0100] FIG. 7 shows a receive buffer format related to the present invention.

[0101] FIG. 8 shows a TX Ring Structure related to the present invention.

[0102] FIG. 9 shows a transmit buffer format related to the present invention.

[0103] FIG. 10 shows a reclassify ring structure related to the present invention.

[0104] FIG. 11 shows a Crypto Ring and COM[4:0] Rings Structure related to the present invention.

[0105] FIG. 12 shows a DMA Ring Structure related to the present invention.

[0106] FIG. 13 is a classification engine block diagram related to the present invention.

[0107] FIG. 14 is a pipeline timing diagram for the classification engine related to the present invention.

[0108] FIG. 15 is an application structure diagram related to the present invention.

[0109] FIG. 16 is a diagram showing an Action Classification Engine (ACE) related to the present invention.

[0110] FIG. 17 shows a cascade of ACEs related to the present invention.

[0111] FIG. 18 shows a system architecture related to the present invention.

[0112] FIG. 19 shows an application deploying six ACEs related to the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0113] Network infrastructure applications generally contain both time-critical and non-time-critical sections. The non-time-critical sections generally deal with setup, configuration, user interface and policy management. The time-critical sections generally deal with policy enforcement. The policy enforcement piece generally has to run at network speeds. The present invention pertains to an efficient architecture for policy enforcement that enables application of complex policy at network rates.

[0114] FIG. 1 shows a Network Infrastructure Application, called Application 2, being deployed on an Application Processor (AP) 4 running a standard operating system. The policy enforcement section of the Application 2, called Wire Speed Policy 3 runs on the Policy Engine (PE) 6. The Policy Engine 6 transforms the inbound Packet Stream 8 into the outbound Packet Stream 10 per the Wire Speed Policy 3. Communications from the Application Processor 4 to the Policy Engine 6, in addition to the Wire Speed Policy 3, consists of control, policy modifications and packet data as desired the Application 2. Communication from the Policy Engine 3 to the Application Processor 4 consists of status, exception conditions and packet data as desired by the Application 2.

[0115] In a preferred embodiment of a Policy Engine (PE) according to the present invention, the PE provides a highly programmable platform for classifying network packets and implementing policy decisions about those packets at wire speed. Certain embodiments provide two Fast Ethernet ports and implement a pipelined dataflow architecture with store-and-forward. Packets are run through a Classification Engine (CE) which executes a programmed series of hardware assist operations such as chained field comparisons and generation of checksums and hash table pointers, then are handed to a microprocessor ("Policy Processor" or PP) for execution of policy decisions such as Pass, Drop, Enqueue/Delay, (de/en)capsulate, and (de/en)crypt based on the results from the CE. Some packets which require higher level processing may be sent to the host computer system ("Application Processor" or AP). (See FIG. 4.) An optional cryptographic ("Crypto") Processor is provided for accelerating such functions as encryption and key management.

[0116] Third-party applications such as firewalls, rate shaping, QoS/CoS, network management and others can be implemented to take advantage of this three-tiered approach to filtering packets. Support for easy encapsulation without copies combined with encryption support allows for VPNs ("Virtual Private Networks") and other applications that require security services.

[0117] A large parity-protected synchronous DRAM (SDRAM) buffer memory is provided, along with a PCI interface that is used for communication with the host (AP) and potentially for peer-to-peer communication among Policy Engines, e.g. for applications which route and switch.

[0118] In certain embodiments the Policy Engine ASIC can be used on a PCI card both for application software development and for use in a PC or workstation as a two interface product, and can also be used in a multiple-segment appliance with a plurality of PE's along with an embedded Application Processor for a stand-alone product.

[0119] In certain embodiments, when used in an appliance, the PE's reside on PCI segments connected together through a plurality of PCI-to-PCI bridges which connect to the host PCI bus on the Application Processor. The PCI bus is 64-bit for all agents in order to provide sufficient bandwidth for applications which route or switch.

[0120] A sample system level block diagram is shown in FIG. 4.

[0121] FIG. 4 shows an application processor 302 which contains a host interface 304 to a PCI bus 324. Fanout of the PCI bus 324 to a larger number of loads is accomplished with PCI-to-PCI Bridge devices 306, 308, 310, and 312; each of those controls an isolated segment on a "child" PCI bus 326, 328, 330, and 332 respectively. On three of these isolated segments 326, 328, and 330 is a number of Policy Engines 322; each Policy Engine 322 connects to two Ethernet ports 320 which connects the Policy Engine 322 to a network segment.

[0122] One of the PCI-to-PCI Bridges 312 controls child PCI bus 322, which provides the Application Processor 302 with connection to standard I/O devices 314 and optionally to PCI expansion slots 316 into which additional PCI devices can be connected.

[0123] In a smaller configuration of the preferred embodiment of the invention the number of Policy Engines 322 does not exceed the maximum load allowed on a PCI bus 324; in that case the PCI-to-PCI bridges 306, 308, and 310 are eliminated and up to four Policy Engines 322 are connected directly to the host PCI bus 324, each connecting also to two Ethernet ports 320. This smaller configuration may still have the PCI-to-PCI Bridge 312 present to isolate Local I/O 314 and expansion slots 316 from the PCI bus 324, or the Bridge 312 may also be eliminated and the devices 314 and expansion 316 may also be connected directly to the host PCI bus 324.

I. Packet Flow

[0124] In certain embodiments, the PE utilizes two Fast Ethernet MAC's (Media Access Controllers) with IEEE 802.3 standard Media Independent Interface ("MII") connections to external physical media (PHY) devices which attach to Ethernet segments. Each Ethernet MAC receives packets into buffers addressed by buffer pointers obtained from a producer-consumer ring and then passes the buffer (that is, passes the buffer pointer) to a Classification Engine for processing, and from there to the Policy Processor. The "buffer pointer" is a data structure comprising the address of a buffer and a software-assigned "tag" field containing other information about that buffer. The "buffer pointer" is a fundamental unit of communication among the various hardware and software modules comprising a PE. From the PP, there are many paths the packet can take, depending on what the application(s) running on the PP decide is the proper disposition of that packet. It can be transmitted, sent to Crypto, delayed in memory, passed through a Classification Engine again for further processing, or copied from the PE's memory over the PCI bus to the host's memory or to a peer device's memory, using the DMA engine. The PP may also gather statistics on that packet into records in a hash table or in general memory. A pointer to the buffer containing both the packet and data structures describing that packet is passed around among the various modules.

[0125] The PP may choose to drop a packet, to modify the contents of the packet, or to forward the packet to the AP or to a different network segment over the PCI Bus (e.g. for routing.) The AP or PP can create packets of its own for transmission. A 3rd-party-NIC (Network Interface Card) on the PCIbus can use the PE memory for receiving packets, and the PP and AP can then cooperate to feed those packets into the classification stream, effectively providing acceleration for packets from arbitrary networks. When doing so, adjacent 2 KB buffers can be concatenated to provide buffers of any size needed for a particular protocol.

[0126] FIG. 2 illustrates packet flow according to certain embodiments of the present invention. Each box represents a process which is applied to a packet buffer and/or the contents of a packet buffer 620 as shown in FIG. 7. The buffer management process involves buffer allocation 102 and the recovery of retired buffers 1 18. When buffer allocation 102 into an RX Ring 402 or 404 occurs, the Policy Processor 244 enqueues a buffer pointer into the RX Ring 402 or 404 and thus allocates the buffer 620 to the receive MAC 216 or 230, respectively. Upon receiving a packet, the RX MAC controller 220 or 228 uses the buffer pointer at the entry in the RX ring structure of FIG. 6 which is pointed to by MFILL 516 to identify a 2
KB section of memory 260 that it can use to store the newly received packet. This process of receiving a packet and placing it into a buffer 620 is represented by physical receive 104 in FIG. 2.

[0127] The RX MAC controller 220 or 228 increments the MFILL pointer 516
modulo ring size to signal that the buffer 620 whose pointer is in the RX Ring 402 or 404 has been filled with a new packet 610 and 612 plus receive status 600 and 602. The Ring Translation Unit 264 detects a difference between MFILL 516 and MCCONS 514 and signals to the classification engine 238 or 242, respectively, for RX Ring 402 or 404, that a newly received packet is ready for processing. The Classification Engine 238 or 242 applies Classification 106 to that packet and creates a description of the packet which is placed in the packet buffer software area 614, then increments MCCONS 514 to indicate that it has completed classification 106 of that packet. The Ring Translation Unit 264 detects a difference between MCCONS 514 and MPCONS 512 and signals to the Policy Processor 244 that a classified packet is ready for action processing 108.

[0128] The Policy Processor 244 obtains the buffer pointer from the ring location pointed to by 512 by dequeueing that pointer from the RX Ring 402 or 404, and executes application-specific action code 108 to determine the disposition of the packet. The action code 108 may choose to send the packet to an Ethernet Transmit MAC 218 or 234 by enqueueing the buffer pointer on a TX Ring 406 or 408, respectively; the packet may or may not have been modified by the action code 108 prior to this. Alternatively the action code 108 may choose to send the packet to the attached cryptographic processor (Crypto) 246 for encryption, decryption, compression, decompression, security key management, parsing of IPSEC headers, or other associated functions; this entire bundle of functions is described by Crypto 112. Alternatively the action code 108 may choose to copy the packet to a PCI peer 322 or 314 or 316, or to the host memory 330, both paths being accomplished by the process 114 of creating a DMA descriptor as shown in Table 3 and then enqueuing the pointer to that descriptor into DMA Ring 418 by writing that pointer to DMA_PROD 1116, which triggers the DMA Unit 210 to initiate a transfer. Alternatively the action code 118 can choose to temporarily enqueue the packet for delay 110 in memory 260 that is managed by the action code 118. Finally, the action code 108 can choose to send a packet for further classification 106 on any of the Classification Engines 208, 212, 238, or 242, either because the packet has been modified or because there is additional classification which can be run on the packet which the action code 108
can command the Classification process 106 to execute via flags in the RX Status Word 600, through the buffer's software area 614, or by use of tag bits in the 32-bit buffer pointer reserved for that use.

[0129] Packets can arrive at the classification process 106 from additional sources besides physical receive 104. Classification 106 may receive a packet from the output of the Crypto processing 112, from the Application Processor 302 or from a PCI peer 322 or 314 or 316, or from the application code 108.

[0130] Packets can arrive at the action code 108 from classification 106, from the Application Processor 302, from a PCI peer 322 or 314 or 316, from the output of the Crypto processing 112, and from a delay queue 110. Additionally the action code 108 can create a packet. The disposition options for these packets are the same as those described for the receive path, above.

[0131] The Crypto processing 112 can receive a packet from the Policy Processor 244 as described above. The Application Processor 302 or a PCI peer 322 or 314 or 316 can also enqueue the pointer to a buffer onto the Crypto Ring 420 to schedule that packet for Crypto processing 112.

[0132] The TX MAC 218 or 234 transmits packets whose buffer pointer have been enqueued on the TX Ring 406 or 408, respectively. Those pointers may have been enqueued by the action code 106 running on the Policy Processor 244, by the Crypto processing 112, by the Application Processor 302, or by a PCI peer 322 or 314 or 316. When the TX MAC controller 222 or 232
has retired a buffer either by successfully transmitting the packet it contains, or abandoning the transmit due to transmit termination conditions, it will optionally write back TX status 806 and TX Timestamp 808 if programmed to do so, then will increment MTCONS 714 to indicate that this buffer 840 has been retired. The Ring Translation Unit 264
detects that there is a difference between MTCONS 714 and MTRECOV 712 and signals to the Policy Processor 244 that the TX Ring 406 or 408 has at least one retired buffer to recover; this triggers the buffer recovery process 118, which will dequeue the buffer pointer from the TX ring 406
or 408 and either send the buffer pointer to Buffer Allocation 102 or will add the recovered buffer to a software-managed free list for later use by Buffer Allocation 102.

[0133] It is also possible for a device in the PCI expansion slot 316 to play the role defined for the attached Crypto processor 246 performing crypto processing 112 via DMA 114 in this flow.

[0134] 1. Communication and Buffer Management

[0135] In certain embodiments, the buffer memory consists of 16 to 128 MB of parity-protected SDRAM. It is used for packet buffers, for code and data structures for the microprocessor, as a staging area for Classification Engine microcode loading, and for buffers used in communicating with the AP and other PCI agents. The following uses of memory are defined by the architecture of the Policy Engine:

[0136] Buffer Pointer rings for RX_MAC_A, RX_MAC_B, TX_MAC_A, TX_MAC_B (where "RX" denotes "receive", "TX" denotes "transmit", and "_A" and "_B" indicate which instance of the MAC is being described.)

[0137] A pool of 2 KB-aligned buffers used for holding packets that are being processed in this chip as well as information about those packets; larger buffers can be created by concatenating these 2 KB buffers if needed for processing larger packets from other media.

[0138] "Reclassification" pointer rings for each of the four Classification Engines; these are used to schedule packets for processing on that CE, when the classification of the packet is being scheduled by an agent other than an RX MAC.

[0139] A ring containing pointers to DMA descriptors used to schedule transfers using the DMA engine; data copies between PCI and memory in either direction are scheduled by enqueuing descriptor pointers on this ring.

[0140] A pool of memory allocated for use as DMA descriptors.

[0141] A pointer ring for scheduling packets for processing on the Crypto unit.

[0142] An area that contains instructions for the microprocessor, including the boot sequence.

[0143] An area for staging microcode to be loaded into the control store of the four Classification Engines.

[0144] Page tables for the Policy Processor MMU

[0145] 16 words dedicated to mailbox communications; writes to these words from the PCIbus also set the corresponding mailbox bit in the mailbox status register which signals to the processor that the indicated mailbox has a new message.

[0146] A pool of 2 KB buffers that belong to the AP and are used for scheduling transmits of packets that have been handed to the AP for processing or that originate at the AP.

[0147] In addition to these uses, parts of the memory may be allocated to the applications running on the PP for storing data such as local variables, counters, hash tables and the data structures they contain, AP to PP and PP to AP application-level communications areas, external coprocessor communication and transmit buffers, etc.

[0148] The Policy Engine takes advantage of the fact that buffers are 2
KB-aligned, and has the hardware ignore the lower 11 bits of each buffer base pointer, thus enabling software to use those pointer bits as tags.

[0149] A simple and lightweight mechanism for buffer allocation and recovery is provided. Hardware support for atomic enqueue and dequeue of buffers through producer-consumer rings, along with detection of completed (retired) buffers enables buffer management in only a few instructions. In the realtime executive loop run on the PP, a short section is devoted to reclaimation of free buffers into the free list from those rings which indicate to the PP that they have retired buffers available for recovery. The RX pools of allocated, empty buffers maintained in the RX Rings can be replenished from the freelist each time a filled, classified RX buffer is. dequeued from that ring, thus maintaining the pool size. A simple linked list of buffers or other method well-known to those versed in the art can be used to implement a software-managed freelist from which to feed the pools.

[0150] In order to support atomic enqueueing/dequeueing of buffer pointers and of DMA descriptor pointers, a standard memory-based producer/consumer ring structure is supported in hardware for many purposes (as represented by the circle-with-arrow symbols in FIG. 3). In most cases one or more of the consumers is also a producer for the next consumer, so the rings have a series of index pointers which chase each other in sequence; for example the MAC RX Rings have a Produce Pointer for the allocation of empty buffers, a MAC FILL Pointer for the RXMAC to consume empty buffers and produce full buffers, a Classification Engine Consume Pointer for the CE to consume freshly received buffers and to produce classified buffers, and a Policy Processor Consume Pointer for the PP to consume classified packets as shown in FIG. 6. The leading producer accesses the ring through an "enqueue" register, and the end consumer accesses the ring through a "dequeue" register, obviating the need for mutexes (mutual exclusion locks) or (slow) memory accesses in managing shared ring structures. Interim consumer-producers fetch a buffer pointer through a ring index, then increment that index later to signal that they have finished processing the referenced buffer and that it is available for the next consumer.

[0151] This serialized multiple-producer/multiple-consumer ring structure allows for one ring to, support a compelled series of steps with much less hardware than would be required to support a separate FIFO between each producer and consumer, and eliminates the need for each consumer-producer to write pointers to the next ring; every cycle saved in a real-time system such as this can be significant.

[0152] Hardware detects when there is a difference between a producer's ring index and the ring index for the next consumer in that communication sequence, and signals to the consumer that there is at least one buffer pointer in its ring for processing; thus the presence of work to do wakes up the associated unit, implementing a dataflow architecture through the use of hardware-managed rings.

[0153] Ring overflow, underflow, and threshold conditions are detected and reported to the ring users and the PP as appropriate.

[0154] 2. Memory and Ring Translated Memory

[0155] 2.1 Memory

[0156] Main memory in the preferred embodiment consists of up to 128 MB of synchronous DRAM (SDRAM) in two DIMM's (Dual In-line Memory Modules) or one double-sided DIMM. Detecting the presence of the DIMMs and their attributes uses the standard Serial Presence Detect interface, using the SPD register to manage accesses to the serial PROM. (The same interface is used to access a serial PROM containing MAC addresses, ASIC configuration parameters, and manufacturing information.) Depending on the size of DIMM's installed, memory might not be contiguous; each socket is allocated 64 MB of address space, and will alias within that 64 MB space if a smaller DIMM is used. Alternatively one 128 MB DIMM is supported, in one socket only.

[0157] 2.2 Ring Translated Memory

[0158] The pointer rings associated with various units are simply a region of memory which is accessed through a translation unit. The translation unit implements the rings as a base register (which is used to assign an arbitrary memory location to be used for the rings) plus a set of index registers which each point to an array entry relative to the base address. Reads and writes to the address associated with a particular index register actually access memory at the ring entry pointed to by that index register; that is, such accesses are indirect. Some index registers are automatically incremented after an access (for atomic enqueue and dequeue operations), issued by leading producers or end consumers while others are incremented specifically by their owner (generally an interim consumer-producer) to indicate that the referenced buffer has been processed and is now available for the next consumer down the chain. Pairs of pointers have a producer-consumer relationship, and a difference between them indicates to the consumer that there is work to do; that difference is detected in hardware and is signaled to the appropriate unit.

[0159] There are 15 rings in the preferred embodiment, each 4 KB in size (1K entries of 4 bytes each); the 60 KB array of 15 rings resides on a 64
KB boundary in memory. The base of this array is pointed to by the Ring Base Register. The rings themselves are not accessed directly; instead they appear to the users as a set of "registers" which are read or written to access the entries in memory that are pointed to by the associated index register. For addressing purposes each ring is assigned a number, which is used as an index both into the array in memory and into the Ring Translation Unit (RTU) register map.

[0160] Writes to a ring will cause the data (which is generally a buffer pointer, or in the case of the DMA Ring, a pointer to a DMA descriptor) to be stored at the location in memory pointed to by [(RingArray[Ring #])+(RTU index register used)], and then that index register is incremented modulo ring size. Reads from a ring will return the data (buffer pointer or descriptor pointer) pointed to by [(RingArray[Ring #])+(RTU index register used)]; if that register is an auto-increment register then it will increment modulo ring size after the read operation. A read attempted via a consumer index register which matches its corresponding produce pointer (that is, there was no work to do) will return zero and the index pointer will not increment. Registers which are not auto-increment are incremented explicitly by that register's owner when the referenced buffer has been processed; the increment is done via a hardware signal, not by register access.

[0161] Ring underflow/overflow and near-empty/near-full threshold status (as appropriate) are reported through the CRISIS register to the PP and the AP.

II. Policy Engine

[0162] FIG. 3 shows a Policy Engine ASIC block diagram according to certain embodiments of the present invention.

[0163] The ASIC 290 contains an interface 206 to an external RISC microprocessor which is known as the Policy Processor 244. Internal to the RISC Processor Interface 206 are registers for all units in the ASIC 290 to signal status to the RISC Processor 244.

[0164] There is an interface 204 to a host PCI Bus 280 which is used for movement of data into and out of the memory 260, and is also used for external access to control registers throughout the ASIC 290. The DMA unit 210 is the Policy Engine 322's agent for master activity on the PCI bus 280. Transactions by DMA 210 are scheduled through the DMA Ring 418. The Memory Controller 240 receives memory access requests from all agents in the ASIC and translates them to transactions sent to the Synchronous DRAM Memory 260. Addresses issued to the Memory Controller 240 will be translated by the Ring Translation Unit 264 if address bit 27 is a `1`, or will be used untranslated by the memory controller 240 to access memory 260 if address bit 27 is a `0`. Untranslated addresses are also examined by the Mailbox Unit 262 and if the address matches the memory address of one of the mailboxes the associated mailbox status bit is set if the transaction is a write, or cleared if the transaction is a read. In addition to the dedicated rings in the Ring Translation Unit 264 which are described here, the Ring Translation Unit also implements 5
general-purpose communications rings COM[4:0] 226 which software can allocate as desired. The memory controller 240 also implements an interface to serial PROMs 270 for obtaining information about memory configuration, MAC addresses, board manufacturing information, Crypto Daughtercard identification and other information.

[0165] The ASIC contains two Fast Ethernet MACs MAC_A and MAC_B. Each contains a receive MAC 216 or 230, respectively, with associated control logic and an interface to the memory unit 220 or 228, respectively; and a transmit MAC 218 or 234 respectively with associated control logic and an interface to the memory unit 222 or 232, respectively. Also associated with each MAC is an RMON counter unit 224 or 236, respectively, which counts certain aspects of all packets received and transmitted in support of providing the Ethernet MIB as defined in Internet Engineering Task Force (IETF) standard RFC 1213 and related RFC's.

[0166] RX_A Ring 402 is used by RX MAC_A controller 220 to obtain empty buffers and to pass filled buffers to Classification Engine 238. Similarly RX_B Ring 404 is used by RX MAC_B controller 228 to obtain empty buffers and to pass filled buffers to Classification Engine 242. TX_A Ring 406 is used to schedule packets for transmission on TX MAC_A 218, and TX_B Ring 408 is used to schedule packets for transmission on TX MAC_B 234.

[0167] There are four Classification Engines 208, 212, 238, and 242 which are microprogrammed processors optimized for the predicate analysis associated with packet filtering. The classification engines are described in FIG. 13. Packets are scheduled for processing by these engines through the use of the Reclassify Rings 412, 416, 410, and 414
respectively, plus the RX MAC controllers MAC_A 220 and MAC_B 228 can schedule packets for processing by Classification Engines 238 and 242, respectively, through use of the RX Rings 402 and 404, respectively.

[0168] There is Crypto Processor Interface 202 which enables attachment of an encryption processor 246. The RISC Processor 244 can issue reads and writes to the Crypto Processor 246 through this interface, and the Crypto Processor 246 can access SDRAM 260 and control and status registers internal to the interface 202 through use of interface 202.

[0169] A Timestamp counter 214 is driven by a stable oscillator 292 and is used by the RX MAC logic 220 and 228, the TX MAC logic 222 and 232, the Classification Engines 208, 212, 238, and 242, the Crypto Processor 246, and the Policy Processor 244 to obtain timestamps during processing of packets.

[0170] Preferably, the Policy Engine Units have the following characteristics:

[0171] 1. PCI Interface

[0172] 33 MHz operation.

[0173] 32/64-bit data path.

[0174] 32-bit addressing both as a target and as an initiator.

[0175] Initiator and Target interface.

[0176] One interrupt output.

[0177] Up to 32-byte bursts as a master; up to 32-byte bursts to memory (BAR0) as a target (disconnects on 32-byte boundaries), single data-phase operations as a target for Register (BAR1) and Ring Translation Unit (BAR2) spaces.

[0178] Single configuration space for the entire device.

[0179] 2. RISC Processor Interface

[0180] Interface to external SA-110 StrongARM processor, running the bus at ASIC core clock or half core clock as programmed in the Processor Control and Status Register.

[0181] Handles all transaction types for PIO's (reads and writes of I/O registers), cache fills/spills, and non-cached memory accesses.

[0182] Low- and high-priority interrupt signals, driven by enabled bits of PISR and PCSR.

[0183] Boots from main memory; an external agent must initialize memory, download local initialization code etc, and release processor reset to enable operation.

[0184] Support for remap of the trap/reset vector to any location in PE Memory.

[0185] 3. Classification Engine

[0186] Microcoded engine for accelerating comparisons and hash lookups.

[0187] Runs a set of comparisons on fields extracted from 32-bit words within a packet to offload processor.

[0188] Operations can be on fields in the packet, or on pairs of result bits from previous comparisons.

[0189] Produces a result vector of one bit result for each comparison or for each boolean operation on pairs of bits in the vector (selected bits of which are then stored in a data structure in the 2 KB packet buffer).

[0190] Can also execute one or more hash lookups on one or more tables based on keys extracted from the packet. Optimized for linked list chasing through the use of non-blocking loads and speculative fetch of the next record; searches of hash tables implementing conflict resolution by chaining are thus accelerated. The hash lookup results are also stored in the packet buffer in memory.

[0191] Arbitrary fields can be extracted from the packet and returned in the packet's data structure to the PP. Arbitrary computation on extracted fields and result vector bits which yield multi-bit results can also be done in the CE, and the results returned to the PP in the data structure.

[0192] The above computations could also incorporate operands found in hash table records found during the above hash searches.

[0193] The contents of hash table records found using keys extracted from the packet can be updated with results of computations such as those described above.

[0194] Supports fast TCP/IP checksum calculation via use of the "split-add" unit.

[0195] Decisions and branches are supported.

[0196] Comparisons, extractions and computations, and hashing are run speculatively before the packet is handed to the Policy Processor; if the code on the PP (the Action section of the application) needs to run rules against the packet, the comparisons are done and ready for it to use, with single-bit decisions ("predicate analysis results") for each policy to apply. Similarly, if the Action code needs to compute or extract information about the packet, the results of that computation are already available in the packet's data structure.

[0197] Packets are scheduled for classification from both the RX MAC ring and a reclassification ring for the "Inbound" CEs, from a reclassification ring alone for "Outbound" CEs.

[0198] 4. Ethernet MACs

[0199] Standard 10/100 Mbit EEE 802.3u-compliant MAC with MII interface to external PHY.

[0200] Each RX MAC has support for a single unicast address match, multicast hash filter, broadcast packets, and promiscuous mode.

[0201] Serial MII management interface to PHY.

[0202] RX MAC inserts packets along with receive status into 2 KB-aligned buffers, with the packet aligned so that the IP header is on a 32-bit boundary; keeping the receive buffer ring replenished with empty buffers is the only processor interaction with the MAC (i.e. there is no run-time device driver needed for the MAC).

[0203] Transmit MAC follows a ring of buffer pointers; scheduling of transmit buffers from any source is supported through a register which makes enqueuing atomic, thus allowing multiple masters to schedule transmits without mutexes.

[0204] Mode bit for PASS or DROP of bad ethernet packets (CRC errors etc).

[0205] Hardware counters to support RMON ETHER statistics gathering.

[0206] MACs operate on 2.5 MHZ/25 MHz RXCLK and TXCLK from the external Fast Ethernet PHY, each has its own clock domain and a synchronizing interface to the ASIC core.

[0207] 5. Memory Controller

[0208] Manages up to two DIM Ms of SDRAM.

[0209] Aggressively schedules two banks independently for high performance.

[0210] Arbitrates among many agents; priorities are:

[0211] 1) MAC_A, MAC_B ping-pong (top prio); internal to each MAC, the TX and RX units arbitrate locally for the MAC's memory interface, with ping-pong priority

[0212] 2) Round-robin priority among PP, CE_AI, CE_AO, CE_BI, CE_BO, DMA, PCI_Target, Crypto

[0213] Supports different speed grades of SDRAM, programmable timing.

[0214] Parity generation and checking.

[0215] Serial Presence Detect (SPD) interface.

[0216] Contains the Ring Translation Unit for mapping Ring accesses to Memory addresses.

[0217] Contains the Mailbox address-matching and status unit.

[0218] 6. DMA Engine

[0219] Can be used by PP, Crypto, and also by the host (Application Processor) and PCI peer devices.

[0220] Moves word-aligned bursts of data between SDRAM and PCIbus.

[0221] Data is transferred between memory and PCI in byte lane order, for endian-neutral transfers of byte streams. See "Endianness" in Section 8.

[0222] Each DMA is controlled by a 16-byte descriptor; the initiator first constructs a descriptor, then enqueues a pointer to that descriptor on the DMA Ring to schedule the transfer.

[0223] Atomic enqueueing is supported to eliminate locks when scheduling DMAs.

[0224] At completion of each DMA, the unit can optionally set one of 8
status bits in the PISR (Processor Interrupt Status Register) or one of 8
status bits in the HISR (Host Interrupt Status Register), as indicated in the descriptor.

[0225] DMA engine ignores lower 11 bits of the SDRAM address, using a separate "buffer offset" instead. This is to support the buffer tag field in the buffer pointer used by software.

[0226] Descriptor is defined in "DMA Command Queue and Descriptors" in Section 6.

[0227] PCI command code is carried in the descriptor for flexibility.

[0228] 7. Crypto Control

[0229] PE ASIC hosts a 32-bit PCI bus for connecting to the Crypto coprocessor(s), with two external request/grant pairs and two interrupt inputs. PP can directly access devices on this bus.

[0230] 4 BAR's ("Base Address Registers", which are part of the PCI standard) are supported: BAR0 for Memory, BAR1 for access to the ring status bits, BAR2 for access to the rings, and BAR3 for prefetched access to Memory.

[0231] Packets are scheduled for encryption by placing a Crypto descriptor in a data structure in the packet buffer in memory, then enqueueing the pointer to that buffer in the Crypto Ring. (Communication Ring 4 is also available for similar use with a second coprocessor.)

[0232] The Crypto chip will detect queue-not-empty by polling the CSTAT (Crypto Status Register) register and will dequeue the buffer pointer at the head of the queue for processing. Two rings are available so that up to two devices can be supported for this function.

[0233] After processing a packet, the Crypto chip will write the results back to memory and then enqueue the buffer pointer on the specified destination ring (for further classification, for examination on the PP, for DMA to a target on the PCI bus, or for transmit.)

[0234] 8. Mailbox Unit

[0235] Monitors 16 word-sized mailboxes in memory space.

[0236] On address match, sets(clears) the status bit in the Mailbox Status Register associated with the word written(read). Selected status bits contribute to a Mailbox Attention status bit in the PISR.

[0237] 9. Ring Translation Unit

[0238] Base pointer to a 64 KB region of memory (only the first 60 KB are used, 4 KB remainder is available for other use).

[0239] Maintains 15 rings as memory arrays of 1K 32-bit entries each.

[0240] Reads and writes to rings through the RTU are mapped to locations in these arrays.

[0241] Some index registers auto-increment, others are incremented by their owner.

[0242] Delta between producer-consumer index pairs is detected in hardware. Any delta is signaled to the consumer indicating that there is work to do.

[0243] 10 of the rings have specific assignment as shown in FIG. 3.

[0244] 5 general-purpose rings COM[4:0] are provided for software to allocate as desired; expected use includes a freelist for DMA descriptors and a freelist of buffers for the AP or peers to use, messages-in to the PP, and others. COM4 can optionally be used as a second Crypto ring.

[0245] Overflow/underflow and threshold conditions are detected and reported through the CRISIS register in the Policy Processor interface.

[0246] 10. Global TIMER

[0247] 32-bit up-counter driven from an external, asynchronous clock source.

[0248] Counts at 1 uS in bit 3 (leaving room for finer granularity in future higher speed implementations.) Counter rolls over approximately every 536.87 seconds.

[0249] Status bit in PISR/HISR sets on every transition (high-low and low-high) in bit[30] to simplify software extension of the timer value.

[0250] An Ethernet crystal (buffered copy) is used as the clock source since it is the most stable timebase available. Runs at 25 MHz.

[0251] In multi-PE implementations, all PE's receive the same clock source to avoid relative drift in timestamps. In systems using multiple PCI cards each containing a PE they each receive a local, non-aligned clock.

[0252] Used by MACs, Classification Engines, and PP for marking events; used for monitoring performance and packet arrival order as needed.

[0253] 11. Serial PROM

[0254] Support for a 24C02 256-byte serial PROM at serial address 0.times.7; the memory DIMMs are at addresses 0.times.0 and 0.times.1 for slots 0 and 1 (if supported).

[0255] PROM at 0.times.7 contains two MAC addresses, full/half-speed control indication for the processor bus, manufacturing information, and other configuration and tracking information.

[0256] Additional devices on the SPD bus include a Crypto Daughtercard IDPROM at address 0.times.6, and a thermal sensor at address 0.times.4.

III. Data Structures

[0257] 1. Ring Array in Memory

[0258] The 15 rings are packed into a 60 KB array aligned on a 64 KB boundary in memory. The RING_BASE register points to the start of this array. Each ring is 4 KB in size and can hold up to 1K entries of 32 bits each.

[0259] FIG. 5 illustrates a ring array in memory.

[0260] The Ring Translation Unit (RTU) 264 manages 15 arrays in memory 260
for communication purposes. Each ring actually consists of 1024 32-bit entries in memory for a total of 4 KB per ring, along with index registers and logic for detecting differences between the index register for a producer and the index register for the associated consumer, which is reported to that consumer as an indication that there is work for it to do. Various near-full-threshold, near-empty-threshold, full, and empty conditions are detected as appropriate to each ring and are reported to the ring users and to the Policy Processor 244 as appropriate. The RTU 264 translates Ring accesses into both a memory 260 access at a translated address, and in some cases into commands to increment specific index pointers after completing that memory access. Each ring is assigned a number for mapping purposes, and that number is used to index into the array of memory 260 in which the rings are implemented. The index registers are incremented modulo 4 KB so that FIFO behavior is achieved. Each index register contains one more significant bit than is used for addressing, so that a full ring can be differentiated from an empty ring.

[0261] A Ring Base Register 400 selects the location in memory 260 of the base of the 64 KB-aligned array 440 represented in FIG. 5. The structure is an array of arrays; there is an array of 15 rings indexed by the ring number, and each of those rings is a 4 KB array of 1024 32-bit entries indexed by various index registers used by different agents.

[0262] RX_A Ring 402 and RX_B Ring 404 implement the structure described in FIG. 6, and are associated with the receive streams from RX MAC_A 220
and RX MAC_B 228 respectively. TX_A Ring 406 and TX_B Ring 408 implement the structure of FIG. 8, and are associated with the transmit MACs 222
and 232 respectively. The Reclassify Rings 410, 412, 414, and 416 are used to schedule packets for classification on Classification Engines 238, 208, 242, and 212 respectively, and implement the structure shown in FIG. 10.

[0263] DMA Ring 418 is used to schedule descriptor pointers for consumption by DMA Unit 210, and implements the structure shown in FIG. 12. Crypto Ring 420 is used to schedule buffers for processing on the Crypto Processor 246 and implements the structure shown in FIG. 11. The five general purpose communication rings CCM[4:0] are available for assignment by software and also implement the structure shown in FIG. 11.

[0264] 2. RX Buffer Pointer Ring and Produce/Consume Pointers

[0265] A ring of buffer pointers resides in the memory for each RX MAC. Associated with this ring are produce and consume index pointers for the various users of these buffers to access specific rings. The Policy Processor allocates free, empty buffers to the MAC by writing them to the associated MPROD address in the Ring Translation Unit (RTU), which writes the buffer address into the ring and increments the MPROD pointer modulo ring size. The RX MAC chases that pointer with the MFILL index which is used to find the next available empty buffer. That pointer is chased by MCCONS which is used by the Classification Engine to identify the next packet to run the classification microcode on. The PP uses a status bit in the PISR to see that there is at least one classified packet to process, then reads the ring through MPCONS in the RTU to identify the next buffer that the PP needs to process.

[0266] FIG. 6 shows an RX Ring Structure related to certain embodiments of the present invention. There are two RX Rings 402 and 404. Each is located in the Ring Array in memory 260. Each has four index registers associated with it. FIG. 6 shows the ring as an array in memory with lower addresses to the top and higher addresses to the bottom of the picture.

[0267] The ring's base address 510 is a combination of the Ring Base Register 400 and the ring number which is used to index into the Ring Array 440 as shown in FIG. 5. Two instances of the set of four index registers MPCONS 512, MCCONS 514, MFILL 516, and MPROD 518 are used to provide an offset from the RX Ring Base 510 of the particular ring 402 or 404, each of which is a 4 KB array 520.

[0268] MPROD 518 is the lead producer index for this ring. The Policy Processor 244 or the Application Processor 302 enqueues buffer pointers into the RX Ring 402 or 404 by writing the buffer pointer to the RTU's enqueue address for the particular ring 402 or 404, which causes the RTU to write the buffer pointer to the location in memory 260 referenced by MPROD 518, and then to increment MPROD 518 modulo the ring size of 4096
bytes. This process allocates an empty buffer to the RX MAC MAC_A or MAC_B associated with ring 402 or 404 respectively.

[0269] MPROD 518 and MFILL 516 have a producer-consumer relationship. Any time there is a difference between the value of MPROD 518 and MFILL 516, the RTU 264 signals to the associated RX MAC MAC_A or MAC_B that it has empty buffers available. The region 506 in the RX Ring 402 or 404
represents one or more valid, empty buffers that have been allocated to the associated RX MAC by enqueueing the pointers to those buffers.

[0270] When the RX MAC MAC_A or MAC_B receives a packet, it obtains the buffer pointer referenced by its associated MFILL pointer 516 by reading from the RTU's MFILL address and then writes the packet and associated RX Status 600 and RX Timestamp 602 into the buffer pointed to by that buffer pointer. When the RX_MAC has successfully received a packet and has finished transferring it into the buffer, it increments the index MFILL 516 by a hardware signal to the RTU which causes the RTU to increment MFILL 516 modulo the ring size of 4096 bytes. MFILL 516 and MCCONS 514
have a producer-consumer relationship; when the RTU 264 detects a difference between the value of MFILL 516 and MCCONS 514 it signals to that ring's associated Classification Engine 238 or 242 that it has a freshly received packet to process. The region 504 in the ring array contains the buffer pointers to one or more fill, unclassified buffers that the RX MAC has passed to the associated Classification Engine.

[0271] The Classification Engine 238 or 242 receives a signal if the RTU 264 detects full, unclassified packets in RX Ring 402 or 404, respectively. When the dispatch microcode on that CE 238 or 242 tests the ring status and sees this signal from the RTU 264, that CE 238 or 242
obtains the buffer pointer by reading from the RTU's MCCONS address for that ring. When the CE 238 or 242 has finished processing that buffer and has written all results back to memory 260, it signals to the RTU 264 to increment its associated MCCONS index 514. Upon receiving this signal the RTU 264 increments MCCONS 514 modulo the ring size of 4096 bytes. By sending the signal, the CE 238 or 242 has indicated that it is done processing that packet and that the packet is available for the consumer, which is action code 108 running on the Policy Processor 244. The region 502 contains the buffer pointers for one or more full, classified packets that the Classification Engine has passed to the Action Code 108.

[0272] MCCONS 514 and MPCONS 512 have a producer-consumer relationship. When the CE 238 or 242 has produced a full, classified packet then that packet is available for consumption by the action code 108. The RTU detects when there is a difference between the values of MCCONS 514 and MPCONS 512 and signals this to the Policy Processor 244 through a status register in the Processor Interface 206. The Policy Processor 224
monitors this register, and when dispatch code on the Policy Processor 224 determines that it is ready to process a full, classified packet it dequeues the buffer pointer of that packet from the RX Ring 402 or 404, as appropriate, by reading the RTU's dequeue address for that ring. This read causes the RTU to return to the Policy Processor 244 the buffer pointer referenced by that ring's MPCONS index 512, and then to increment MPCONS 512 modulo the ring size of 4096 bytes. The act of dequeueing the buffer pointer means that the pointer no longer has any meaning in the RX ring. The contents of the ring in locations between MPCONS 512 and MPROD 518 have no meaning, and are indicated by the Invalid regions 500 and 508. Since this is a ring structure which wraps, 500 and 508 are actually the same region; in the figure shown, due the current values of the ring index pointers 512, 514, 516, and 518 the Invalid regions 500 and 508
happens to wrap across the start and end of the array containing this ring, but it should be obvious to one skilled in the art that under normal circumstances these ring index pointers can have different values and any of regions 502, 504, or 506 could also be region which wraps around the end and beginning of the array 520.

[0273] 2.1 RX Buffer Structure

[0274] The receive data buffer is a 2 KB structure which contains an Ethernet packet and information about that packet. A substantially similar format is used for transmitting the packet, as indicated in FIG. 8. The packet offset from the base of the buffer is designed so that upon receive the Ether header is offset by two bytes into a word, thus aligning the IP header on a word (32-bit) boundary. Enough space is left before the packet so that encapsulation/encryption headers (e.g., up to 40 bytes for a standard IPv6 header plus AH and ESP) can be inserted for encapsulation of the packet without copying the packet, by just copying the Ethernet header up to make space and then inserting the encapsulation headers. The total pad size is 112 Bytes; if more is needed then the Crypto Coprocessor can realign the packet when writing it back.

[0275] The RX MAC can be programmed to either drop bad packets or receive them normally; if the latter, then error status is also shown in the buffer RX status field.

[0276] FIG. 7 illustrates the receive buffer format.

[0277] A packet is passed around the system by placing it into a packet buffer 620 and then passing the 2 KB-aligned buffer pointer among units via pointer rings implemented by the RTU 264. The RX Status and Transmit Command Word 600 is always located at the word pointed to by the 2
KB-aligned buffer pointer. All hardware in the Policy Engine 322 is designed to assume that a buffer pointer is 2 KB-aligned and to ignore bits [10:0], which allows software to use bits [10:0] of the buffer pointer to carry software tag information associated with that buffer.

[0278] Upon receiving a packet the RX MAC 220 or 228 places that packet at an offset of (130) bytes from the beginning of a buffer 620, and writes zero to the bytes at byte offset (128) and (129) from the beginning of that buffer; these two bytes are called the Ethernet Header Pad 618. The packet consists of the (14)-byte Ethernet header 610 and the payload 612
of the Ethernet packet, which are stored contiguously in the buffer 620. The reason for inserting the Ethernet Header Pad is to force protocol headers encapsulated in the Ethernet packet to be word (32-bit) aligned for ease in further processing; encapsulated protocols such as IP, TCP, UDP etc have word-oriented formats.

[0279] The RX MAC control logic 220 or 228 then writes the RX Status Word 600 into the buffer 620 at an offset of (0) from the start of the buffer, and an RX Timestamp 602 as a 32-bit word at byte offset (4) from the start of the buffer 620. The RX Status Word has the format shown in Table 1. The timestamp is the value obtained from the Timestamp Register 214 at the time the RX status 600 is written to the buffer 620. The TX Status Word 604 and the TX Timestamp 606 are not written at this time, but those locations covering the two 32-bit words at offsets of 8 and 12 bytes, respectively, from the start of the buffer 620 are reserved for later use by the TX MAC controllers 222 and 232.

[0280] The format for the RX Status word in Table 1 is such that it can be used directly as a TX Command Word without modification; the fields LENGTH and PKT_OFFSET have the same meaning in both formats. The RX MAC controller 220 or 228 subtracts (4) bytes from the Ethernet packet's length before storing the LENGTH field in the RX Status Word 600 such that the (4-byte) Ethernet CRC is not counted in LENGTH, so that the buffer can be handed to a TX MAC 222 or 232 without need for the Policy Processor 244 modifying the contents of the buffer.

[0281] Pad Space 608 is left before the start of the packet 610 and 612 in buffer 620 to support the addition of encapsulating protocol headers without copying the entire packet. Up to (112) bytes of encapsulation header(s) can be inserted simply by copying the ethernet header 610 (and possibly an associated SNAP encapsulation header in the start of payload 612) upwards into the Pad Space 608 by the number of bytes necessary to make room for the inserted headers, which are then written into the location that was opened up for them in areas 608, 610, and 612 as needed. If more than (112) bytes of encapsulation header are being inserted then the entire payload 612 must be copied to a different location in the buffer to make room for the inserted headers.

[0282] The per-packet software data structure 614 is used by the classification 106, action code 108, encryption processing 112, the host 302 and PCI peers 322, 314, and 316 to carry information about the packet that is carried in the buffer 620. The location of the software data structure 614 and the sizes of the packet header 610 and packet payload 612, as well as the total size of the packet buffer 620 are not hard limits in the preferred embodiment. The 2 KB-alignment of the RX status word 600 and RX Timestamp are enforced by the hardware; but packets from other sources and also from other media besides Ethernet can be injected into the classification flow of FIG. 2 as follows. The SOURCE field of the RX status word 600 as shown in Table 1 has only a few reserved codes; the rest can be assigned by software to identify packets from other sources and also from other media which do not share the packet format or packet size of Ethernet. By software convention larger buffers can be assigned by grouping contiguous 2 KB buffers together and treating them as one buffer; the pointer to this larger buffer 602 will still be 2
KB-aligned and the RX Status Word 600 and RX Timestamp 602 will still reside at that location in the buffer. The packet area 610 and 612 can be made arbitrarily large to accommodate a packet from a different medium. The location of the software data structure 614 can be moved downwards as the larger payload space is allocated. Alternatively the software can choose to allocate buffers so that they have space before the 2
KB-aligned RX Status Word 600, and carry the software data structure 614
above the RX Status Word 600 rather than below the Payload 612 as shown in FIG. 7. The advantage of this second approach is that the location of the software data structure is always known to be at a fixed location relative to the RX Status Word 600, rather than having that location be a variable depending on different media and the resulting variations in the size of the packet payload 612.

[0283] The section marked "Available for software use" contains transient per-packet information such as the result vector and hash pointers output by the Classification Engine, a command descriptor for the Crypto Unit, buffer reference counts, an optional pointer to an extension buffer, and any other data structures that the software defines. "TX Status/TX Timestamp" is optionally written by the transmit MAC if it is programmed to do so; that field contains garbage after an RX.

[0284] The "RX Timestamp" field contains the 32-bit value of the chip's TIMER register at the time that the packet was successfully received (approximately the time of receipt of the end of packet) and the RX_STATUS field was written. The "RX Status" field is one 32-bit word with the following format:

[0285] Note throughout this document that bit [31] is the left (most significant) bit of a 32-bit word, and bit [0] is right (least significant). "MCSR" mentioned in Table 1, below, is the MAC Control and Status Register.

1TABLE 1
Ethernet RX Status Word and TX Command Word Format Bits Field Description [31] BAD_PKT Summary error bit; set if any of [30:27, 15:14] is set, which can only happen if the MAC is programmed to receive bad frames. [30] CRC_ERR Ethernet frame had incorrect CRC and (MCSR[RCV_BAD]==1) for this MAC. [29] RUNT Ethernet frame was smaller than legal and (MCSR[RCV_BAD]==1) for this MAC [28] GIANT Ethernet frame was larger than legal and (MCSR[RCV_BAD]==1) for this MAC [27] PREAMB_ERR Invalid preamble and (MCSR[RCV_BAD]==1) for this MAC. This error is associated with some previous event, not with the current packet. [26:16] LENGTH For RX, number of bytes in the Ethernet frame including the Ethernet header but not including the Ethernet CRC. For TX, length of packet, including CRC if (MCSR[CRC_EN]==0) [15] DRBL_ERR Odd number of nibbles received (dribble) and (MCSR[RCV_BAD]==1) for this MAC [14] CODE_ERR 4b/5b encoding error and (MCSR[RCV_BAD]==1) for this MAC [13] BCAST The received packet was a broadcast packet (destination address is all 1's) [12] MCAST The received packet was a multicast packet and was passed by the multicast hash filter [11:08] SOURCE This indicates the source of the packet or other source as marked later by software. If the packet was generated at a RX MAC then this field is 0x0 for MAC_A or 0x1 for MAC_B. [07:00] PKT_OFFSET This is the byte offset from the beginning of the packet buffer to the first byte of the Ethernet header. Other agents may choose to move this offset in order to encapsulate the IP packet or to strip of encapsulation headers. The CE, PP, and AP all use this offset when accessing the frame in this buffer. The RX MAC will always write a value of 0x82 into this field, indicating that the Ethernet Frame was received into the buffer starting at byte offset 130 from the start of the buffer.

[0286] The same packet buffer format is used for encryption and transmission; for those uses the only meaningful fields are LENGTH, PKT_OFFSET and the contents of the Ethernet frame found at that offset; plus for encryption the encryption descriptor included in the "Software" area in the buffer.

[0287] 3. TX Buffer Pointer Rings and Producer/Consumer Pointers

[0288] A packet gets scheduled for transmission by enqueueing the address of the buffer onto the pointer queue for that transmit MAC, by writing it to MTPROD in the RTU (MAC A and MAC B each have their own ring and associated registers). Any time the produce pointer is not equal to the consume pointer for that ring, the associated MAC will be notified that there is at least one packet to transmit and will follow the pointer to obtain the next buffer to deal with. When the packet has been retired the TX controller will write back status if configured to do so, then increment the consume pointer and continue to the next buffer (if any.)

[0289] The recover pointer is used to track retired buffers (either successfully transmitted or abandoned due to transmit termination conditions) for return to the buffer pool, or possibly for a retransmit attempt; the PP is signaled by the RTU that there is a delta between MTCONS and MTRECOV, and then reads the Ring through the RTU register MTRECOV to get the pointer to the next buffer to recover. MTPROD, MTCONS, and MTRECOV are duplicated for each instance of a transmit MAC.

[0290] FIG. 8 illustrates the TX Ring Structure according to certain embodiments of the present invention.

[0291] The TX Rings 406 and 408 have substantially the same structure as the RX Rings described previously. The fundamental differences are that there is one fewer interim producer-consumer using this ring, and that this ring is assigned for a different function with different agents using it. Each ring 406 and 408 is a 4096-byte array 720 in memory 260.

[0292] A packet is scheduled for transmit on the TX MACs 222 or 232 by enqueuing a pointer to the buffer containing the packet onto TX Ring 406
or 408, respectively. The buffer pointer is enqueued onto 406 or 408 by any agent, by writing the buffer pointer to the RTU 264 enqueue address for that ring. The RTU 264 writes the buffer pointer to the location in memory 260 referenced by the MTPROD index register 716, and then increments MTPROD 716 modulo the ring size of 4096 bytes. There is a producer-consumer relationship between MTPROD 716 and MTCONS 714; when the RTU detects a difference in the values of MTPROD 716 and MTCONS 714
it signals to the associated TX MAC controller 222 or 232 that there is a packet ready to transmit. The region 706 in the TX Ring 406 or 408
contains one or more buffer pointers for the buffers containing packets scheduled for transmission.

[0293] The TX MAC controller 222 or 232 obtains the buffer pointer for the buffer 206 containing this packet by reading the RTU's MTCONS address for TX Ring 406 or 408, respectively, which causes the RTU to return to the MAC the buffer pointer in memory 260 referenced by MTCONS 714. When the TX MAC 218 or 234 has successfully transmitted this packet or has abandoned transmitting this packet due to transmit termination conditions, its controller 222 or 232 respectively will optionally write back TX Status 806 and TX Timestamp 808 if it has been configured to write status, then retires the buffer by signaling to the RTU 264 to increment MTCONS 714. Upon receiving this signal the RTU 264 will increment MTCONS 714 modulo the ring size of 4096 bytes.

[0294] Index registers MTCONS 714 and MTRECOV 712 have a producer-consumer relationship. When the RTU detects a difference in their values, it signals to the PP that the associated TX ring 406 or 408 has a retired buffer to recover. That information is visible to the Policy Processor 244 in a status register in Processor Interface 206 which the Policy Processor 244 polls on occasion to see what work it needs to dispatch. Upon testing the RECOVER status for the TX Ring 406 or 408 and detecting that there is at least one buffer to recover, the Buffer Recovery code 118 reads the RTU's 264 MTRECOV address for that ring to dequeue the buffer pointer from the TX ring 406 or 408. The read causes the RTU to return the buffer pointer referenced by MTRECOV 712, and then to increment MTRECOV 712 modulo the ring size of 4096 bytes. The region 704
contains the buffer pointers of buffers which have been retired by the TX MAC 222 or 232 but have not yet been recovered by the Buffer Recovery code 118.

[0295] The regions 702 and 708 are the same region, which in the figure shown are spanning the end and the beginning of the array 720 in memory 260 which contains the TX Ring 406 or 408. This region contains entries which are neither a buffer pointer to a buffer ready for transmit, nor a buffer pointer to a buffer which the TX MAC 222 or 232 has retired but the recovery code 118 has not yet dequeued. For the purposes of a TX Ring 406 or 408 this region consists of space into which more packets may be scheduled for transmit. One skilled in the art will recognized that region 704 or region 706 could just as easily be the region wrapping around the array boundary, depending on the values of MTRECOV 712, MTCONS 714, and MTPROD 716.

[0296] Embedded in the buffer is the packet length in bytes (including the Ethernet header, but not including the CRC since the TX MAC will generate that) and also the byte offset within the buffer where the Ethernet header begins. The offset is necessary since the start of packet might have been moved back.(if adding encapsulation headers) or forward (if decapsulating a packet.) The Ethernet header typically starts at byte offset 0.times.2 within that word, but the TX MAC supports arbitrary byte alignment. PKT_OFFSET and LENGTH are found in the "RX Status" and "TX Command" word of the buffer as described in Table 1; for transmit purposes those are the only two meaningful fields in that word.

[0297] The area labeled "TX Status/TX Timestamp" is optionally written with one word of transmit status plus the value of TIMER at the time the field is written, if MCSR[TX_STAT] is set; the content of that word is described in Table 2.

[0298] FIG. 9 illustrates the transmit buffer format according to certain embodiments of the present invention.

[0299] When a packet is scheduled through TX Ring 406 or 408 to be transmitted on a TX MAC 218 or 234, respectively, the TX MAC controller 222 or 232, respectively, interprets the contents of the packet buffer 840 in accordance with the format shown in FIG. 9. The RX Status Word and TX Command Word 802 is found at the location pointed to by the 2
KB-aligned buffer pointer obtained from the TX Ring 406 or 408. The RX Status and TX Command Word 802 is in the format specified by Table 1; when this word is interpreted by the TX MAC controller 222 or 232 only the fields LENGTH and PKT_OFFSET have any meaning and the rest of the word is ignored. PKT_OFFSET indicates the byte offset from the start of the 2 KB-aligned buffer at which the first byte of the Ethernet header is to be found, and LENGTH is the number of bytes to be transmitted not including the (4-byte) Ethernet CRC which the TX MAC 222 or 232 will generate and append to the packet as it is being transmitted. The RX Timestamp 804 was used by previous agents processing this buffer, and is not interpreted by the TX MAC controller 222 or 232.

[0300] The PKT_OFFSET field can legitimately have any value between (16) and (255), allowing the agent that scheduled the transmit to manipulate headers and to relocate the start of the packet header 812 as needed. FIG. 9 shows a zero-filled two-byte pad 830 prior to the start of Ether Header 812, but that is not a requirement of the preferred embodiment; the TX MAC 222 or 232 can transmit a packet which starts at any arbitrary byte alignment in the transmit buffer 840. The two-byte pad 830 shown preceding the header 812 is shown to illustrate the common case, wherein a received packet was thus aligned and any movement of the ethernet header 812 for encapsulation or decapsulation of protocols is in units of words (4 bytes.) Pad Space 810 can vary in size from zero bytes to (240) bytes as defined by the value of PKT_OFFSET in the TX Command Word 802.

[0301] The concatenation of Ether Header 812 and Payload 814 comprise the packet that is transmitted, along with the generated Ethernet CRC which the TX MAC 222 or 232 appends during transmit. The Ethernet CRC field 816
is not normally used by the TX MAC 218 or 234, but was written there during receive by the RX MAC 220 or 228. Each TX MAC controller 222 and 232 has a configuration setting which can instruct it to not generate CRC as it transmits; in that case the LENGTH field in the TX Command Word 802
includes the four bytes of Ethernet CRC, and the data in 816 is sent with the packet for use as the packet's CRC. This configuration which uses software-generated Ethernet CRC is provided primarily as a diagnostic tool for sending bad packets to other devices on the network.

[0302] Upon completion or abandonment of a transmit, the TX MAC will write back the TX Status Word 806 and the TX Timestamp 808 if it is so configured. The TX Status Word 806 contains the information and format shown in Table 2. The TX Timestamp 808 is written with the value of the Timestamp Register 214 at the time the write to TX Timestamp 808 is initiated.

[0303] The software data structure 820 which travels in the packet buffer 840 along with the packet is the same one 614 discussed in the description of an RX buffer 620 as shown in FIG. 7, and may be relocated by software convention as described in the discussion of FIG. 7.

[0304] The transmit status word 806 contains a flag indicating if the transmission was successful, and the reason for failure if the transmit was abandoned. This field is written only if MCSR[TX_STAT] is set, otherwise the fields 806 and 808 contain uninitialized data.

2TABLE 2
Ethernet TX Status Word Bits Field Description [31] TX_OK Packet was successfully transmitted. [30] LATE_COL Transmit abandoned due to a late collision. (only if (MCSR[LATE_COL_RTRY] == 0)) [29] XS_COL Transmit abandoned due to excessive collisions (16
collisions) [28] XS_DEFER Transmit abandoned due to excessive deferrals [27] UNDERFLOW Transmit abandoned due to slow memory response times. [26] GIANT Packet length was larger than legal [25:22] COL_CNT Number of collisions experienced (never shows [3:0] more than 15; if XS_COL this value is `x`) [21:11] reserved MAC writes 0x0 to this field. [10:0] TX_SIZE Number of bytes transmitted (includes [10:0] the 4-byte Ethernet CRC)

[0305] There are 5 possible transmit packet sources sharing the TX MAC; these are

[0306] The RISC processor (Policy Processor) generating or forwarding a packet

[0307] Crypto generating a modified packet

[0308] The AP either creating, forwarding, or modifying a packet

[0309] A device in a PCI expansion slot creating, forwarding, or modifying a packet

[0310] A peer PE forwarding a packet to a different network segment (e.g. for routing or switching)

[0311] Atomic enqueueing by multiple sources is supported via writes to RTU[MTPROD] associated with that MAC's Transmit Ring. The RTU can detect high-water-mark conditions and signal the situation to the PP and the AP. The MTCONS index pointer is incremented by the MAC whenever a buffer is retired; that is chased by another consume pointer incremented by reads of RTU[MTRECOV] which is used by the PP for recover of retired packet buffers to the buffer pool and (optionally) checking TX status.

[0312] 4. Reclassify Rings

[0313] The Classification Engine receives packets to classify from both the RX MAC (via the RX Ring), and from other sources (PP, AP, Crypto, and potentially other network cards on the PCIbus). A second input ring (Reclassify Ring) is provided for each CE for these other sources to schedule a packet for classification on that CE; each comprises a ring in memory with enqueue and dequeue operations supported through the RTU. The 32-bit entries in the ring are buffer pointers.

[0314] FIG. 10 shows the reclassify ring structure.

[0315] The Reclassify Rings 410, 412, 414, and 416 serve a very similar purpose to the RX Rings 402 and 404, and have substantially the same structure. The substantive differences are that there is one less interim consumer-producer in the Reclassify Rings, and that packets get scheduled through the Reclassify Rings via a different path. Reclassify Rings 410, 412, 414, and 416 are used to schedule packets for processing on CE 238, 208, 242, and 212 respectively.

[0316] In the case of the RX Ring 402 or 404, buffer pointers are enqueued by the Buffer Allocation process 102 running on the Policy Processor 244
using MPROD 518, which allocates the referenced buffers as free and empty for the RX MAC 220 or 228, respectively, to consume using MFILL 516 when receiving a packet and to produce a full, unclassified buffer to the CE 238 or 242, respectively. Packets scheduled for classification via the Reclassify Rings 410, 412, 414, and 416 come from a source other than the RX MAC's 220 or 228, as illustrated in FIG. 2. Full, unclassified buffers get scheduled onto one of the Reclassify Rings when an agent enqueues the buffer pointer onto the ring by writing the buffer pointer to the RTU's 264 enqueue address, which causes the RTU 264 to write the buffer pointer to the location in memory 260 referenced by RPROD 916 and then to increment RPROD 916 modulo the ring size of 4096 bytes.

[0317] From that point onward the description is substantially the same as the description of the RX Ring 402 and 404, except that RCCONS 914 is used in place of MCCONS 514, RPCONS 912 is used in place of MPCONS 512, the invalid region 902 and 908 substitutes for 500 and 508, Full and Classified 904 substitutes for 502, and Full Unclassified 906 replaces 504. Since this flow has no allocation of empty buffers there is no equivalent to MFILL 516 nor to Valid Empty 506.

[0318] Note that the "Outbound" classifiers 208 and 212 each have only a Reclassify Ring 412 and 416, respectively, but no RX Ring since they are not associated with an-RX MAC.

[0319] 5. Crypto Command Queue and General Purpose Communications Rings

[0320] In order to schedule buffers for processing by the external (and optional) encryption engine another memory-based ring containing buffer pointers is implemented, with enqueue and dequeue operations supported through the RTU for the Crypto unit to get the next buffer to process, plus a status bit indicating to Crypto that there is at least one packet buffer pointer in the ring to process. The information about what operations to perform, keys, etc. are embedded in a Crypto Command Descriptor in the software area of the buffer.

[0321] FIG. 11 shows-the Crypto Ring and COM[4:0] Rings Structures.

[0322] The Crypto Ring 420, COM0 Ring 422, COM1 Ring 424, COM2 Ring 426, COM3 Ring 428, and COM4 Ring 430 are identical in structure. Any agent can enqueue a buffer pointer or, in the case of the COM Rings, any 32-bit datum, by writing to the RTU's 264 enqueue address associated with the particular ring. This causes the RTU to store the buffer pointer or 32-bit datum to the location in memory 260 referenced by the specified PRODUCE Pointer 1010 and then to increment PRODUCE 1010 modulo the ring size of 4096 bytes. There is a producer-consumer relationship between a particular ring's PRODUCE pointer 1010 and that ring's CONSUME pointer 1008. When the RTU detects a difference between the values of PRODUCE 1010 and CONSUME 1008 it signals to the consuming unit that there is at least one entry to be consumed.

[0323] The consumer dequeues a 32-bit entry from one of these rings by reading from the RTU's dequeue address associated with that particular ring; this causes the RTU to return the data at the address in memory 260
referenced by that CONSUME pointer 1008 and then to increment CONSUME 1008 modulo the ring size of 4096 bytes. As is illustrated here, the degenerate case of the multiple-producer, multiple-consumer ring structure described in FIGS. 6, 8, and 10 is a single-producer, single-consumer FIFO with fifo-not-empty status presented to the consumer. The COM rings 422, 424, 426, and 428 all report ring-not-empty status and (programmably per ring) either near-full or near-empty threshold status to the Policy Processor 244 through status registers in the processor interface 206. These rings can be assigned for any purpose; anticipated uses include a message-in ring for the Policy Processor 244, a ring for allocating buffers for use by remote agents, and a ring for allocating DMA descriptors for use by remote agents scheduling this Policy Engine's DMA Unit 210.

[0324] The Crypto Ring 420 reports ring-not-empty status to the Crypto Processor 246 through a status register in Crypto Interface 202. COM4 430
also reports ring-not-empty status through a similar location, so that COM4 430 can optionally be used to support scheduling packets for processing by a second Crypto Processor 246. The Crypto Processor Interface 202 has additional support for a second Crypto Processor 246, which might be added to provide either more bandwidth for encryption processing or additional functionality such as compression. Packets would be scheduled for processing on this second processor 246 by enqueueing their buffer pointers onto COM4 430. Alternatively, both the Crypto Ring 420 and COM4 430 can be used to schedule buffers for processing on the one Crypto processor 246.

[0325] The general purpose communication rings COM[4:0] 422, 424, 426, 428, and 430 are identical in structure to the Crypto Ring 420.

[0326] 6. DMA Command Queue and Descriptors

[0327] The DMA engine also uses a ring unit with an Enqueue register for any agent to schedule DMA transfers (DMA_PROD), a Consume register for the DMA engine to get entries from the ring (DMA_CONS), and a Dequeue register for recovering retired descriptors (and the associated buffers) from the ring (DMA_RECOV).

[0328] The DMA engine is used to move data between the memory and the PCIbus; the source/target on PCI can be host (AP) memory or another PCI device. DMA operations are scheduled by creating a 16-byte descriptor in memory and then enqueueing the address of that descriptor in the DMA engine's command ring by writing it to DMA_PROD. The PP, the host, a PCI bus peer, and Crypto can atomically schedule use of this engine.

[0329] DMA is notified by the RTU when the Produce pointer is not equal to the Consume pointer and processes the next descriptor. When that descriptor is retired, DMA increments the Consume pointer; a delta between that and the Recover pointer causes the RTU to signal to the PP that there are DMA descriptors (and the associated buffer pointers) to recover.

3TABLE 3
DMA Descriptor Format PCI_Address [31:00] Flags [31:0] S1[31:27] Buf_Address[26:11] S2[10:0](pointer tag field) S3[15:11] Buf_Start_Index[10:2] 0b00 Word_Count[15:0]

[0330] The areas labeled "S2" and "S3" are available for software use. "S1" is reserved for future expansion of PE memory size.

[0331] Upon completion of a transfer, the DMA engine can optionally set a completion status bit in either the Host Interrupt Register or Processor Interrupt Status Register in case the initiating agent wants completion status of a transfer or group of transfers. 8 bits are provided in each so that transfers can be tagged as desired. This allows both AP and PP software to have up to 8 DMA completion events scheduled at one time for tracking when particular groups of transfers have completed, or for the PP to signal to the AP that information has been pushed up to a mailbox or communication ring in AP memory, or for similar signals from the AP to the PP.

[0332] The Packet Buffer Address field contains the packet buffer pointer in the same format that is used by all other agents in the Policy Engine; this means that bits [10:0] are ignored by hardware and might contain tag information. The actual memory word address is the concatenation of the 2
KB-aligned Packet_Buffer_Address[31:11] with Start_Index[10:2], with 00
in the lower two bits. Note that the Word_Count allows for a maximum DMA transfer of (64K-1 Words, or 256K-4 Bytes), in case there are transfers larger than normal packet buffer movement (e.g. moving down PP code or CE microcode).

[0333] The Flags word contains the following fields:

4TABLE 3a DMA Descriptor "Flags" Word Bits Field Descriptions [31:21] SOFT[10:0] Available for software use. [20] TO_MEM Direction: 1 == To Memory (From PCI), 0 == From Memory (To PCI) [19:16] PCI_CMD[3:0] This is the PCI command code which is used on the PCI bus for these transactions; the most common codes will be 0x7 (Memory Write) and 0x6 (Memory Read) with some probability of also using 0xC (Memory Read Multiple) and 0xE (Memory Read Line) if the attached host uses them for prefetch directives. [15:08] SET_HISR[7:0] Any bit that is set will set the corresponding status bit in the HISR upon retirement of this descriptor. If no bit is set, no status is sent to HISR. [07:00] SET_PISR[7:0] Any bit that is set will set the corresponding status bit in the PISR upon retirement of this descriptor. If no bit is set, no status is sent to PISR.

[0334] Since DMA descriptors are read from memory by the DMA engine, software must ensure either that the descriptors were non-cacheable by the processor, or that they are flushed from the PP cache prior to writing the descriptor's address to the DMA ring. For descriptors that are generated by the AP or by a PCI peer see "Endianness" in section 8
for details about descriptor endianness.

[0335] FIG. 12 shows the DMA Ring Structure.

[0336] The DMA Ring 418 is substantially the same as the TX Rings 406 and 408 as described in FIG. 8. There is a single enqueue index DMA_PROD 1116
used to schedule pointers on the ring 418 by any agent, and interim