United States Patent5581778
Chin , ; et al.December 3, 1996

Title

Advanced massively parallel computer using a field of the instruction to selectively enable the profiling counter to increase its value in response to the system clock

Abstract

A parallel computing system comprising N blocks of processors, where N is an integer greater than 1. Each block of the N blocks of processors contains M processors, where M is an integer greater than 1. Each processor includes an arithmetic logic unit (ALU), a local memory and an input/output (I/O) interface. The computing system also contains a control means, connected to each of the M processors, for providing identical instructions to each of the M processors, and a host means, coupled to each of the control means within the N blocks of processors. The host means selectively organizes the control means of each of the N blocks of M processors into at least two groups of P blocks of M processors, P being an integer less than or equal to N. In operation, the host means causes the control means within each group of P blocks of M processors to provide each group of P blocks of M processors respectively different identical processor instructions. To facilitate communications amongst the processors, the parallel computing system includes an interprocessor communications channel that selectively interconnects the processors.


Inventors:Chin; Danny (West Windsor Township, NJ), Peters, Jr.; Joseph E.  (East Brunswick, NJ), Taylor, Jr.; Herbert H.  (Hopewell Township, NJ)
Assignee:David Sarnoff Researach Center (Princeton, NJ)
Appl. No.:416932
Filed:April 4, 1995

Current U.S. Class:712/16 712/245 
Field of Search:395/800,727,375,775 364/DIG.1,DIG.2

U.S. Patent Documents
4344134August 1982Barnes
4382295May 1983Moffitt et al.
4414624November 1983Summer, Jr. et al.
4598400July 1986Hillis
4608631August 1986Stiffler et al.
4644496February 1987Andrews
4667305May 1987Dill et al.
4736363April 1988Aubin et al.
4740954April 1988Cotton et al.
4773038September 1988Hillis et al.
4783738November 1988Li et al.
4809362February 1989Claus et al.
4837676June 1989Rosman
4866668September 1989Edmonds et al.
4893234January 1990Davidson et al.
4965718October 1990George et al.
5018133May 1991Tsukakoshi et al.
5088091February 1992Schroeder et al.
5105424April 1992Flaig et al.
5109379April 1992Kume et al.
5113523May 1992Colley et al.
5117430May 1992Berglund
5121502June 1992Rau et al.
5125076June 1992Faber et al.
5129077July 1992Hillis
5165009November 1992Watanable et al.
5175865December 1992Hillis
5197130March 1993Chen et al.
5202987April 1993Bayer et al.
5257629November 1993Kitney et al.
5291489March 1994Morgan et al.
5329471July 1994Swoboda et al.
5388220February 1995Okabayashi
5452425September 1995Childers et al.
Re34100October 1992Hartness
Foreign Patent Documents
0360527Mar., 1990EP
0460599Dec., 1991EP
Other References
Conpar 88, Manchester, UK, 12-16 Sep. 1988, ISBN 0-521-37177-5, 1989, Cambridge, UK, Cambridge University Press, UK, pp. 10-17, Giloi Wolfgang K., "The Suprenum Architecture". .
Proc. of the Symposium on Frontiers of Massively Parallel Comp. Oct. 8-18, 1990, No. Symp. 3, 8, pp. 196-203, Bridges T, "The GPA Machine: Agenerally Partitionable MSIMD Architecture". .
D. Chin et al "The Princeton Engine: A Real-Time Video System Simulator" IEEE Trans. Cons. Electronics v34, 1-12 (1988). .
R. Alverson et al "The Tera Computer System" 1990 International Conference On Supercomputing Jun. 11-15, 1990 Amsterdam, Netherlands. .
A. Fischer "Scan Line Array Processors For Image Computation" Computer Architecture News vol. 14, No. 2, Jun. 2-5, 1986..~
Primary Examiner: An; Meng-Ai T.
Attorney, Agent or Firm:Burke; W. J.

Government Interests:


This invention was made with Government support under Contract No.
MDA-972-90-C-0022. The Government has certain rights in the invention.

Parent Case Text



CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation of application Ser. No. 08/091,935 entitled ADVANCED MASSIVELY PARALLEL COMPUTER APPARATUS filed on Jul. 14, 1993 now abandoned, which is a continuation-in-part of U.S. application Ser. No. 07/926,265 filed Aug. 5, 1992 now abandoned.

Claims


The invention claimed is:
1. A parallel computing system comprising:
a plurality of processors, wherein each processor comprising:
a source of a system clock signal,
an arithmetic logic unit (ALU),
means for indicating local data conditions in the processor;
a local memory,
an input/output (I/0) interface, and
a profiling counter having a counter value which is incremented in response to the system clock signal, when the profiling counter is enabled;
control means, coupled to said plurality of processors, for providing identical processor instructions to each of the plurality of processors; and
host means, coupled to said control means, for providing control instructions and processor instructions to the control means;
wherein each of said processor instructions and each of said control instructions include a field which is used to selectively enable and disable the profiling counters in the plurality of processors.

2. The system of claim 1 wherein:
the control means is responsive to a first control instruction to load a profiling counter of the control means with either an immediate count value or a count value obtained from a data register coupled to the control means, and to a second control instruction to store the count value into the data register; and
each of said plurality of processors is responsive to a first processor instruction to load the profiling counter of the processor with either an immediate count value or a count value obtained from the local memory of each of said plurality of processors, and to a second processor instruction to store the count value into the local memory.

Description

This invention relates to massively-parallel computer apparatus and, more particularly, to such apparatus capable of providing multiuser time-shared operation thereof.

BACKGROUND OF THE INVENTION

While both large sequentially-operating and parallel-operating supercomputers are known in the art, massively-parallel operation is to be preferred for those computationally-intensive applications which require a vast amount of data computation and data communication to be carried out in real time. Examples of such applications include weather modeling and medical imaging. Real-time analysis of such complex scenarios encountered by such applications operate on very large data sets.

The growth of advanced problem size has been such that the maximum rate of communication and computation of these prior-art massively-parallel supercomputers, including the Princeton Engine (PE), is insufficient to provide real-time solutions therefor. Therefore, there is a need for a larger massively-parallel supercomputer which would meet both the bandwidth and computation requirements (an I/O bandwidth up to 1200 MBytes/sec, and a peak computational rate up to 9.6 Teraops/sec) needed to provide solutions to such computationally-intensive problems. Further, although sequential supercomputers are capable of time-shared multi-user operation, the design of all prior-art massively-parallel supercomputers does not permit them to have this capability.

SUMMARY OF THE INVENTION

A parallel computing system is described which is arranged as N blocks, each containing M processors. Each processor has an arithmetic and logic unit (ALU), a local memory and an input/output (I/O) interface. Each block also includes a controller which is coupled to provide a group of identical instructions to each of the M processors in the block. The parallel computing system also includes a host processor which is coupled to several of the control means of the N blocks. The host processor partitions these blocks into at least first and second groups of blocks, each group including P blocks. For each group of P blocks, a respectively different group of identical processor instructions are provided to each of the P times M processors by the host processor.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 illustrates how video frames are distributed over the memories of a prior-art Princeton Engine (PE);

FIG. 2 illustrates resources in the prior-art PE which permit a host computer to have digital access to load or capture data on controller busses for system or algorithmic testing purposes;

FIG. 3 is a high level view of a parallel computing system of the invention (the SE);

FIG. 4 is an expansion of an engine block (EB) showing the interconnection of the hosts, controllers, processors, local memories, and I/O functions of the SE;

FIG. 5 illustrates the physical arrangement of system modules;

FIG. 6 shows the processor organization of the SE;

FIG. 7 shows the use of a Stride Register of the SE;

FIG. 8 shows an example of the Modulo Arithmetic Mode of the SE;

FIG. 9 shows an example of the Bounding Mode of the SE;

FIG. 10 is a resource usage table for the SE processor;

FIG. 11 illustrates a match example of two packed data words;

FIG. 12 illustrates a match sequence and the corresponding templates;

FIG. 13 illustrates matches found between match and data sequences;

FIG. 14 illustrates an example of conditional locking;

FIG. 15 illustrates 4 different modes of a processor instruction word;

FIG. 16 illustrates 4 different examples of IPC operation;

FIG. 17 illustrates input slices (4 slices per chip) of an Input/Output Memory Controller (IOMC) of the SE;

FIG. 18 illustrates output slices (4 slices per chip) of an IOMC;

FIG. 18a is a block diagram of exemplary image vault interface circuitry;

FIG. 19 illustrates data I/O data formats;

FIG. 20 illustrates video data formats;

FIG. 21 illustrates data input captured by input FIFO (first-in-first-out);

FIG. 22 illustrates an input timing sequence example;

FIG. 23 illustrates 2 schemes for processor handling of multiple pixels;

FIG. 24 illustrates the transfer of data from input FIFO to local memory;

FIG. 25 illustrates a FIFO input timing sequence example;

FIG. 26 illustrates the loading of a data output channel with output FIFO data;

FIG. 27 illustrates the transfer of data from local memory to output FIFO;

FIGS. 27a, 27b, 27c, 27d, 27e, 27f, 27g, 27h, and 27i drawings of arrays of memory locations which are useful for describing the operation of the input and output FIFOs;

FIG. 28 illustrates a local OR (LOR) bus;

FIG. 29 illustrates a controller synchronization switch;

FIG. 30 illustrates a conceptual grouping of controllers;

FIG. 31 illustrates a synchronization switch configuration for controllers;

FIG. 32 illustrates a barrier synchronization example;

FIG. 33 illustrates operating system components.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

To facilitate the description of the invention, an alphabetic listing of some acronyms employed herein is appended hereto.

The prior-art Princeton Engine (PE) architecture is a single-instruction multiple data (SIMD) linear array of processors. The linear array can be scaled from 64 to 2048 processors in steps of 64 and, in a full configuration, achieves a computational rate of 28,672 MIPS for an instruction clock of 14 MHz. Each processor has a local memory and can communicate with its neighbors via two bidirectional channels. Input and output data rates of 14 and 1.8 Gbps, respectively, are provided. The PE host is an Apollo/Mentor Graphics workstation, and high-resolution monitors are used for observation of output results.

Each processing element PE0 to PEn-1 of the PE contains seven independent, internal 16-bit data paths, a 16-bit ALU, a 16-bit multiplier, a triple-ported register stack with 64 elements; a 16-bit communications port; and up to 640K Bytes of external SRAM local memory. The register file has one address port for read-only access to that file and a second address port for read or write access to that file. An interprocessor communications bus (IPC) permits exchanges of data between neighboring processors during one instruction cycle. On each instruction cycle, up to six simultaneous operations can take place (input or output via the I/O bus, simultaneous read and write at the register file, one multiplication, one ALU operation, and a local memory access).

Input data is stored as one pixel per processor in each processor's local memory M0 to Mn-1 for each scan line 0 to v-1 of video. Thus, over a frame period, one pixel column of a video frame is stored in each local memory. The local memory is sufficient to store up to 640 columns of 8-bit pixels for a 1024 line frame. Functional diagram 100 of FIG. 1 illustrates how video frames are distributed over the local memories. Each corresponding column of a video frame sequence 0 to z-1 is stored in the same local memory. Thus, temporal algorithms do not require communications among processors; simple memory addressing is sufficient. Neighboring columns of spatial data are stored in the local memory of neighboring processors. Horizontal filters and statistical gathering operations require data communications between the processors via IPC 102.

The IPC can be set in one of four modes: normal, bypassed, broadcast send, and broadcast receive. Normal communication is between neighbors within the linearly connected array. Data is loaded onto the IPC channel in one instruction, and shifted left or right on the next instruction. This mode is very efficient for nearest-neighbor computations.

In some cases it is desirable to perform neighborhood operations on a subgrid of the original array. This decimation can be achieved without compressing array elements into a smaller connected region. Rather, processors are bypassed, providing new neighborhood connections among the desired domain. Left and right shift operations traverse the bypassed interconnection pattern.

In FIG. 2, the PE is interfaced to analog and digital sources and destinations through the controller cabinet 200. The input and output data channels to the parallel array are 48 bits and 64 bits wide, respectively. These channels are clocked at 28 MHz and interface 6 analog-to-digital converters (ADCs) and 7 digital-to-analog converters (DACs). The host computer has digital access to load or capture data on these busses for system or algorithmic testing purposes.

The controller 200 also provides user-selectable clocks for the ADCs and DACs. Up to three independent input clocks and four independent output clocks are possible. This capability permits several different data sources to be simultaneously read, processed, displayed, and compared. The outputs may be taken to a variety of displays: a spectrum analyzer, or even back into the user's existing real-time system hardware for embedded applications.

Output from parallel processor 202 is user-programmable through a special output, multi-port, RAM structure 204 embedded within a bit-slice I/O IC. Local memory accesses are reduced by this unique output architecture. The output data stream may further be routed back to the input of the parallel array for additional processing. This feature provides a real-time transpose capability useful in radar processing (comer turn) and for fast rotation of large 3D data sets.

The SE has 32-bit processors, with 15 independent programmable operations per instruction and double the memory bandwidth (two local memory ports per processor). The total number of processors in a full system is 8192 and each processor is designed to operate with a 100 MHz clock (10 ns instruction cycle), for a computational data rate of 819 MIPS and 9.6.times.1012 operations/sec. Another major improvement is that the SE has Multiple Instruction Multiple Data (MIMD) capabilities; there is a controller for every 64 processors, with each controller able to broadcast a different instruction stream to its processors. This architecture organization provides up to 128 MIMD instruction streams, with hardware support for synchronization between controllers. Data I/O is more robust on the SE. Whereas real-time PE programs operated on a single line of an input image, real-time SE programs operate on an entire image. The SE also can operate in a multiuser mode where the system can be configured to time-share the machine to support several real-time and non-real-time applications without interference between applications. The system can also be reconfigured into several smaller systems to run applications. FIG. 3 provides a high level view of the machine organization.

For a system clock of 100 Mhz, the controller functions must be integrated and contained in close proximity to the processors. Controller 300 is responsible for broadcasting instructions to processing elements 302 and maintaining information on processes and signals. Each controller contains an instruction memory and a microsequencer that dictates program control flow. Information on the active processes is maintained in a process control memory. The use of redundant slices that contain a number of processors 302, local memories 304, I/O functions 306, and controller functions further satisfy the need to serve multiusers and to provide MIMD capability.

The shaded section of FIG. 3 is a redundant slice of the EB of the SE. The EB consists of 64 processors, their respective local memories, I/O functions, and the controller functions including an interface to a host workstation 308. The EB will physically consist of a multichip module containing one Controller IC, a program memory module, 16 processor ICs, 16 local memory modules, and 16 IOMC ICs. The processor and IOMC functions are described below.

FIG. 3 also shows the connection of the processors to the image vault 320. The image vault is a large secondary storage array which may be accessed from the IOMC of each processor. In the exemplary embodiment, the image vault 320 is implemented as a distributed disk storage, having terabyte capacity at the system level and megabyte capacity at the processor level. Data transfer rates of up to 32 gigabytes per second are provided at the system level by providing a data rate of 4 megabytes per second for each processor. The image vault may be used to store relatively long image sequences or large databases.

FIG. 4 is an expansion of the EB showing the interconnection of the hosts, controllers, processors, local memories, and I/O functions for up to 128 slices. When the system is reconfigured into smaller systems, each subsystem has a host workstation 400 assigned to it, and each HIO bus remains local to its subsystem. When the full SE is used, only the leftmost host workstation/VME bus is active, and the HIO buses for each slice are connected together in serial. The global OR (GOR), local OR (LOR), and neighboring OR (NOR) buses are used for synchronizing sets of controllers.

Processors are connected to each other in a linear array via the IPC. This architecture allows for linear increases in processing power with no overhead penalty as more processors are added to the system. Fault tolerance is an important issue at the level of parallelism exploited in the SE and is provided through the approach used in the PE. In this approach, faulty processors or memories can be switched out of the system by bypass operations on the IPC.

All processors within an EB operate in a SIMD mode with a 128-bit instruction word being broadcast to the processors. Different actions can be carried out on these processors through a conditional locking capability. All processor I/O is memory-mapped and it is the responsibility of the IOMC to transfer data between the local memory and I/O sources.

Hardware support for profiling and debugging have been incorporated into the design of the SE. Each processor and controller has a dedicated profiling counter and the controller contains a debug interrupt mechanism.

In FIG. 5 the SE is physically composed of system modules 500 that are shaped as hexagons, 20 inches to a side, and 8 inches thick. Each module 500 contains 16 EBs, associated power supplies, a coolant input, and a coolant exhaust. Each EB comprises 64 processors and their respective local memories and I/O functions, and the EB controller functions. One EB will be packaged using sixteen multichip modules using advanced memory fabrication techniques. Each system module is self contained and can function as a 1024 processor machine or as sixteen 64 processor machines. The system modules can be stacked vertically (8 high)to realize a full 8192 processor machine.

In FIG. 6 the processor for the SE may be implemented on an IC chip preferably containing 4 processors using BiCMOS technology, and has a 10 nanosecond instruction cycle. The processor operates on a 128-bit instruction word received from the controller. The instruction word specifies 15 independently programmable operations. The processor uses 32-bit data paths and registers and some data paths and registers can be paired for transferring and storing 64-bit data. Furthermore, some resources, such as the ALU, register file, and local memory can operate on 64-bit inputs.

Each processor, shown in FIG. 6, has a 64-bit ALU 600, a 32-bit multiplier 602, a 32-bit matcher 604, a 32-bit auxiliary ALU 606, a 128 word register file 608, a dual port local memory that is addressed by two address generators 610-1 and 610-2, IPC ports 612 for communicating with other processors, conditional locking hardware 614, and a dedicated profiling counter 616.

To maximize the number of operations per instruction cycle, the integer and floating point multiplier and ALU units are unified. Many processors have separate integer and floating point ALUs, and parallelism is realized since much of the computation is executed in floating point data format, while the integer ALU is used for memory addressing. Since the SE has two dedicated address generators, and since floating point and integer operations are usually not computed at the same time, the integer and floating point units were grouped together to save IC area for other resources.

The multiplier can multiply two 32-bit values and produce a 64-bit result on each instruction cycle. The multiplier supports IEEE floating point, signed and unsigned integer operations. The result is stored in the 64-bit P register, which is an input to the ALU so that products can be accumulated; this is efficient for DSP operations such as FIR filters. Alternatively, the multiplier can treat the two 32-bit input values as a 64-bit value, and load the P register with the 64-bit word. This is useful for supplying the ALU with 64-bit data.

The matching unit 604 is included in the processor design because it is ideal for data intensive operations. To economize the Processor Instruction Word, the multiplier 602 and matcher 604 share the same instruction field. The matcher 604 is a specialized hardware component that executes matching operations on 32-bit packed data. Data is packed when smaller word sizes are formatted into a single 32-bit word. The packed word size is stored in the B register legal values are 1-8, 16 and 32. The number of matches are stored in the P register, so that they may be accumulated in the ALU if necessary. The packed data matching feature is extremely efficient for comparing long data sequences stored in local memory, without the need to unpack the data.

The matcher 604 and multiplier 602 share the same data inputs, output, and instruction word field. Thus, the use of the multiplier 602 and matcher 604 are mutually exclusive; when one resource is active, the other is inactive. (The matcher 604
has a NOP instruction so the case of two inactive units is possible.)

The ALU has 32-bit and 64-bit inputs, and has two 64-bit accumulators. It supports single cycle integer and IEEE floating point (32-bit and 64-bit) operations. The accumulators are also inputs to the ALU, and can be used for storing intermediate values of a computation. The P register and accumulators serve as 64-bit inputs to the ALU; all other data sources are 32-bit sources.

ALU operations include the regular 32-bit and 64-bit arithmetic and logic operations, shifting operations, and integer/floating point conversion operations. A multiple cycle integer divide operation is also supported. There are conditional operations supported such as conditional subtraction, and Update ACC1 if Zero/NonZero (used to implement a conditional write.) Special purpose operations include a MAXMIN binary operation that stores the larger value in ACC1 and the smaller value in ACC2, find-first-zero-bit and find-first-one-bit unary operations, and absolute value.

An Auxiliary ALU (AuxALU) is used for 32-bit counting operations. The AuxALU is included in the processor design since counting operations are very common for image processing applications. A speedup of a factor of six is achieved for conditional counting operations, since the extra ALU allows the counting operations to be pipelined. The AuxALU can increment a value conditionally based on a result of the ALU (stored in the Processor Status Word). For example, counting the number of values above a threshold value is implemented by comparing values with the threshold value in the ALU, and conditionally incrementing the AuxALU count based on the ALU result.

The AuxALU is located near the RI1 port of the register file. The AuxALU has two registers: the AuxALU Data Register (ADR) and the AuxALU Condition Mask Register (ACMR). The ADR contains the AuxALU operand, and the ACMR contains the Processor Status Word mask for monitoring conditions.

A special function of the AuxALU is to decrement the ADR value and lock the processor on a zero result. This operation can be used for operations whose execution time is data dependent. As each processor finishes its operation, it decrements the value to zero, locks itself (performs NOPs), and asserts its LOR signal to signal the controller it has finished. When all processors have completed the operation and asserted their LOR signals, the controller unlocks all of the processors, and execution continues. This operation is useful for implementing loops that are dependent on local data conditions on a group of SIMD processors. The LOR is a 1-bit wire that connects the processors to the controller; the LOR signal is low until all processors assert a high signal, which then raises the LOR signal high.

A 32-bit dedicated profiling counter 616 (shown in FIG. 6) is on each processor for real-time profiling. In addition, each controller includes a dedicated profiling counter 3301 (shown in FIG. 33) which is used for real-time profiling. Profiling is usually implemented by adding additional instructions to the original program to count occurrences of events. This type of profiling is not possible in the real-time mode since some program segments are critically timed, such as communication through the IPC. The dedicated profiling counters are used to perform profiling without interfering with processor execution.

Each of the processor profiling counters 616 and the controller profiling counters 3301 is controlled by two-bits in the corresponding controller or processor instruction word to perform one of four functions: load counter value, start counter, stop counter and reset counter. The function of reading the count value from the counter is controlled as a write operation to the register which is to receive the result. In the processor instruction word formats shown in FIGS. 15a through 15d, the profiling counter control field is shown as the two-bit field PCC.

In addition, the number of instructions encountered before the profiling counters increments may be modified by setting a two-bit field in the Processor Status Word (PSW). The four states of this two-bit field may be used to cause the profiling counter 616 to increment on every instruction, or on every four, 16 or 64 instructions, respectively. The PSW is described in more detail below.

Each processor has a 128 word (32-bit words) register file (RF) 608 (shown in FIG. 6). On each instruction cycle, up to four reads and two writes can be executed, which provides the bandwidth necessary for keeping the functional units active, thereby increasing the on-chip parallelism, and reducing the memory access bottleneck. The RF 608 has two input ports (RI1, RI2) and four output ports, which are directed to the registers RO1-RO4. On each instruction cycle, two 32-bit words can be written to the RF 608 and four 32-bit words can be read from the RF 608. Register pairs [RO1,RO2] and [RO3,RO4] can also be used as 64-bit register pairs for other processor resources.

Each processor has an 8 megaword, dual port, DRAM memory. The controller refreshes the local memory via a bit in the processor instruction word. Since each processor has its own local memory, there is no contention for memory among processors. On each instruction cycle, two 32-bit word memory accesses can be executed, which doubles the memory bandwidth for processor computation, thereby reducing the memory bottleneck. A 64-bit value can be accessed by reading/writing the upper and lower words simultaneously. The memory size is large enough for a group of 64 processors to store 2 gigabytes, or 64 8K.times.8K images. The organization of data across local memories is the same as in FIG. 1.

Two address generators, one for each memory port of each processor, perform address arithmetic operations so that the main ALU is not used for addressing operations. The address generators have special addressing modes that eliminate the need to check and/or enforce certain conditions on array accesses, thus increasing the efficiency. The address generator performs hardware bound checking on arrays, and can compute stride updates for array accesses. Additionally, some special boundary conditions can be specified for arrays.

The address generators share 6 sets of addressing registers they are the User Base, User Limit, Base, Limit, Offset and Stride registers. The User Base and User Limit register sets delimit the local memory space used by the program for each of the 8 memory banks. All program references (with the exception of system call addresses) are relative to the User Base values, which makes the program data relocatable. An access violation signal is sent to the controller if a generated address is outside the User Base and User Limit bounds. Sixteen Base registers are used to define the starting locations for aggregate data such as arrays, tables, and records (structures), and 16 Limit registers define the data bounds. Bounds checking hardware verifies that an array access is within bounds. The 8 Offset registers and 8 Stride registers provide an efficient means for repeatedly accessing an array in regular strides. After a base-offset pair is used to compute the address of the array element to be accessed, the value of the Stride register is used to update the Offset register value, setting up the next array access.

There are 4 address arithmetic modes for array access: normal mode, modulo arithmetic mode, bounding mode, and butterfly arithmetic mode. In the normal mode, when a access is out of bounds, the Out of Bounds (OOB) bit is set in the processor status word. The modulo arithmetic mode maps an out of bound array reference to an array location using modulo arithmetic. In the bounding mode, when an array access is out of bounds, the offset value is substituted with the offset of a user-specified boundary constant. The butterfly arithmetic mode is used to generate addresses for FFT butterfly pairs, where the FFT is stored as a single dimension array.

The IPC bus interface is integrated into the processor chip for efficient, low latency communication between processors.

Each processor has conditional locking hardware that provides processors with a conditional locking execution mechanism that allows SIMD processors to execute in a MIMD manner by allowing processors to lock themselves by conditionally executing code based on local data conditions. The processor execution state is defined as `locked` when the processor performs NOPs (no-operation) instead of the instructions being sent to it from the instruction sequencer on the controller. The processor continues to execute NOPs until the instruction to unlock is encountered in the processor instruction word. When a processor is locked, the IPC is still active, and certain bookkeeping operations are still executed by the processor to determine when it should unlock.

The instructions that lock and unlock the processor occur within a structured segment, where there is a `begin` and `end` statement. These segments are similar to if-then-else constructs and can be nested. Decisions to lock and unlock always pertain to the most closely nested construct. Conditional locking code involves no change in control flow. The instructions are broadcast serially from the controller, and the processors select which code to execute based on locking instructions and local data conditions. Conditional locking information is stored in the processor status word. Instructions to save and restore the context are supported for servicing interrupts, which require all processors to be unlocked.

The address generator is the processor component that computes addresses for accessing the local memory. It provides all of the basic addressing modes plus additional operations for efficiently computing regular array accesses. There are two address generators per processor; each local memory port has a dedicated address generator.

The address generators use a common set of registers to access memory. There are 8 User Base Registers (UB0-UB7), 8 User Limit Registers (UL0-UL7), 1 Bank Select Register (BSR), 16 Base Registers (BR0-BR15), 16 Limit Registers (LR0-LR15), 8
Offset Registers (OR0-OR7), and 8 Stride Registers (SR0-SR7).

The User Base and User Limit Registers are used to delimit the program data for the eight banks of local memory. Data for a program must be stored contiguously in each bank. The BSR is a three bit register used to determine which memory bank is active. The 16 BSRs and LSRs are used to delimit array data structures. All indexing into an array is relative to the BR, and the LR is used by the address generator to determine if a reference into the array structure is out of bounds. The 8 ORs are used to point at a specific location within an array, and the 8 SRs are used to update the offset value by the contents of the SR.

The address word has the following format (1) Absolute/UB-Relative Addressing, (3) Bank Select and (20) Memory Bank Address.

The address generators operate on a 23-bit address; the most significant 3 bits specify the bank of memory, and the lower 20 bits specify a word in the megaword (32-bit words) of bank memory. Since the addresses are stored in 32-bit locations, there are 9 additional bits that are not used for addressing, some of which carry additional information.

One bit is used to determine whether the address is Absolute or UB Relative to the User Base value. UB-Relative Addressing is used when accessing program data. This implementation makes the program data relocatable. An Access Violation occurs if a UB Relative address is greater than the User Limit value or less than zero. In the Absolute Addressing mode, the address is not added to the User Base value. This mode is used for accessing shared system information, which is stored in low local memory.

The address generators use the BR0-BR15 and the LR0-LR15. The base registers define starting locations for aggregate data such as arrays, tables and structures, and the limit registers define the addressing bounds of the aggregate data. This allows the hardware to perform bounds checking on each memory access at run-time. BR and LR control is constrained so that BRx must be used with LRx. Only the lower eight BR0-BR7 and LR0-LR7 can be used in base-limit-offset-stride (BLOS) operations, described below. The base registers contains a 24 bit value: 20 bits for the address, 1 bit for Absolute/UB-Relative addressing, and a 3 bit field to specify the base register memory bank. The limit registers contain a 20 bit bounding offset for the BRs.

The address generator also contains 8 21-bit offset registers (OR0-OR7) and 8 20-bit stride registers (SR0-SR7). These registers provide an efficient means for repeatedly accessing an array in regular strides. After a base-offset pair (BRx, ORx) is used to compute the address of the array element to be accessed, the value of the SRx is used to update the offset register, thereby setting up the next array access. In addition to the SR0-SR7, the hardwired constants 0, +1 and -1 are available as stride values. The OR value is automatically updated by the stride register value, so if no offset update is needed, a stride of zero is specified. If the new offset value is out-of-bounds, the OOB bit is set in the processor status word (PSW). Only the lower eight BR0-BR7 and LR0-LR7 are used in BLOS operations, and the hardware control is constrained so that BRx must be used with LRx, ORx and SRx. The SRs hold a 21-bit 2s complement value, and the offset registers hold a positive 20-bit value.

As an example of the use of an SR, consider FIG. 7. In this example, the offset is initially 2 and the stride value is 3. Successive array accesses are shaded. An address is generated on every instruction cycle.

There are six addressing modes available: Immediate, Register Direct, Direct, Indirect, Base Relative, and Base Indexed. The first two addressing modes do not use the address generator.

The address generators are not used in the Immediate mode. A value is specified in the immediate field of Processor Instruction Word for use in a processor operation.

The address generators are not used in the Register Direct mode. A value is read from or written to the register file. A register direct read is executed by specifying the register file address in the RO1, RO2, RO3, or RO4 fields of the Processor Instruction Word. The contents of the specified register file location is then loaded into the appropriate ROx register. A register direct write is executed by specifying the register file address in the RI1 or RI2 fields of the Processor Instruction Word. The value at the RIx port is written to the specified register file location.

In the Direct Addressing mode, scalar data stored in the local memory is accessed by specifying the address in local memory. It is more efficient to store scalar data values in the register file, but there are situations where it is necessary to store scalar data into local memory (as in the case of register spills, or indirection, where a scalar value is pointed to by a pointer). An address is specified by specifying a displacement to the User Base Register using the Direct Source (DS). The effective address computation is Effective Address=DS+UBy.

The Indirect Addressing mode is best used for implementing pointers to data in memory. A base register is loaded with the address of the data in local memory. The upper eight BR8-BR15 should be loaded with indirect address values first, since offsets are not needed in this mode. This mode is equivalent to Base Relative Addressing with zero displacement. The effective address computation is Effective Address=BRx+UBy.

Base Relative Addressing mode is best used for structure member accesses, and random accesses into arrays (such as table lookups), where the array is not accessed in a regular pattern. A base register is loaded with the base address of an aggregate data structure, such as a structure or array. A displacement is sent via the Direct Source as the offset (DS). The upper eight BR8-BR15 should be loaded with base relative address values first, since offsets are not needed in this mode. The Address Arithmetic Mode, described below, can be used with base relative addressing. The effective address computation is Effective Address=BRx+DS+UBy.

Base Indexed Addressing mode is best used for arrays that are accessed in a regular pattern. A base register, limit register, offset register and stride register are loaded with initial values. After the effective address is generated, the offset value is updated by adding the stride value. Only the lower eight BR0-BR7 and LR0-LR7 can be used for BLOS operations. The Address Arithmetic Mode can be used with base indexed addressing. The effective address computation is Effective Address=BRx+ORx+UBy and ORx=ORx+SRx

There are four Address Arithmetic Modes (AAM) that are used in conjunction with the Base Relative and Base-Indexed Addressing Modes. These special-purpose modes are used to reduce computation for common forms of array accesses. These modes are implemented in hardware to operate on a one-dimensional array in local memory. They are Modulo Arithmetic Mode, Bounding Mode, Butterfly Arithmetic Mode and Normal Mode.

The Modulo Arithmetic Mode will map an out-of bound array access into the array using modulo arithmetic. The modulo value is provided by the Limit Register Value. The Bounding Mode will provide the address of a user-specified boundary condition value when the array access is out-of bounds. The Butterfly Arithmetic Mode will generate the addresses of all butterflies for a stage of an FFT. The Normal Mode does nothing do modify an out-of-bound access.

In the Base Relative Addressing mode, the modulo arithmetic effective address is computed DS=DS modulo LRx and Effective Address=BRx+DS+UBy.

In the Base Indexed Addressing mode, the modulo arithmetic effective address is computed, and the offset is updated after the effective address is generated Effective Address=BRx+ORx+UBy where, ORx=(ORx+SRx) modulo LRx.

The modulo operation is computed as: ##EQU1##

In an example of the Modulo Arithmetic Mode, see FIG. 8, a two dimensional array is distributed over the processors, one column on each processor. The processor that has the data for column 30 of the array A has generated the offset 150, which is greater than the upper limit of 99 for the 100 element column. Under this mode, the limit value is subtracted from the offset to yield a new offset that is within bounds: element 50. The mode also checks to see if the offset is less than zero, and if it is, it adds the limit value to the offset to yield a new offset that is within the bounds of the array.

In the Bounding Mode, when an array access is out of bounds, the offset value is substituted with the address location of the boundary condition value. This is implemented in the following way: the default boundary-condition is stored in the location immediately following the last array location, so that it is stored at location (BRx+LRx). When an out-of bound address is detected, the address generator returns the address (BRx+LRx), which is the location of the boundary condition value.

In the Base Relative Addressing mode, the bounding mode effective address is computed as Effective Address=BRx+bound(DS)+UBy.

In the Base Indexed Addressing mode, the modulo arithmetic effective address is computed as:

Effective Address=BRx+ORx+UBy where ORx=bound (ORx+SRx).

The bound offset operation bound(X) is computed as: ##EQU2##

In an example of the Bounding Mode, see FIG. 9, a two dimensional array is distributed over the processors, one column on each processor. The processor that has the data for column 30 of the array A has generated the offset 120, which is greater than the upper limit of 99 for the 100 element column. Under this mode, the address [Base+Limit] is generated, which contains the constant zero. Thus, when the array element A[30,120] is accessed, the value zero is returned.

The Butterfly Arithmetic Mode is used to generate addresses for FFT butterfly pairs for an FFT whose real and imaginary components are stored as single dimension arrays.

The address generators have two modes: the address generation mode, and the setup mode for loading and writing address registers. The mode is determined by the Address Generator Mode Bit in the Processor Instruction Word. Both address generators share this bit, and so both address generators are always in the same mode of operation.

In setup mode, the 10-bit address generator field in the Processor Instruction field has the following format: (2) Read/Write Enable, NOP, (2) Direct Source Select, (3) Register Select (3) Register Number.

The 2-bit Read/Write Enable field determines whether a register is read into the address generator register file set, or written to the RAM. When a write is made to the RAM, the corresponding RAM instruction field must also specify a write. When the address generator writes a register value to the RAM, the write overrides the Write Data Select field selection in the RAM field. The 2-bit DS Select determines selects the source for loading data into the address generator register file set.

The 3-bit Register Select chooses the register set to be loaded. The register sets are: 1) UB0-UB7, 2) UL0-UL7, 3) BR0-BR7, 4) BRS-BR15, 5) LR0-LR7, 6) LRS-LR15, 7) OR0-OR7, 8) SR0-SR7.

The 3-bit Register Number field chooses which register within the set of eight is the active register.

In the address generation mode, the Processor Instruction Word has the following format:

______________________________________ (2) Addressing Modes 00 Direct Addressing 01 Indirect Addressing 10 Base-Relative Addressing 11 Base-Indexed Addressing (2) Direct Source Select (valid for Direct, Base-Relative Addressing Modes) (2) AAM Select (valid for Base-Relative, Base-Indexed Addressing Mode) 00 Modulo Arithmetic Mode 01 Bounding Mode 10 Normal Mode 11 Butterfly Arithmetic Mode (2) Stride Select (valid for Base-Indexed Addressing Mode) 00 Constant 0 01 Constant 1 10
Constant -1 11 Use stride register specified in BLOS Register Select (3) BLOS Register Select (valid for Base-Indexed Addressing Mode) (4) Base Register Select (valid for Indirect, Base-Relative Addressing Modes) Bit Allocation for Address Generator Instruction format Direct Addr. 00 xxdd xxxx Indirect Addr. 01 xxxx bbbb Base-Rel. Addr. 10 mmdd bbbb Base-Indexed 11 mmss Obbb bbbb: base register select Obbb: BLOS select dd: direct source select mm: Addr. Arith. Mode select ss: stride select xx: not used ______________________________________

As an example of the utility of the address generators, consider the multiplication of the following two 3.times.3 matrices: ##EQU3## C1=A1.times.B1+A2.times.B4+A3.times.B7 C4=A4.times.B1+A5.times.B4+A6.times.B7

C7=A7.times.B1+A8.times.B4+A9.times.B7

For efficiency, many computing systems will store the data for one matrix in row-major format and the other matrix in column-major format in order to reduce the mount of addressing computation and/or maximize cache use. However, on the SE, the matrix data can be stored in a consistent format, since the address generators have stride update capabilities.

Specifically, the data for both matrices can be stored in row-major format; the first matrix (whose row is being used for the computation) would use a stride of one, while the second matrix (whose column is being used for the computation) would use a stride of three, which is the distance between elements of the same column for a 3.times.3 array stored in row-major format.

Because the address generators free up the ALU from computing address arithmetic, performance is increased. In fact, a very tightly pipelined loop is formed by the processor resources. FIG. 10 is a resource usage table which demonstrates the efficiency of the processor. The columns of the table represent resources, and the rows represent instruction cycles. A shaded entry represents usage of the resource for a specific instruction cycle. The table represents the computation for a column of the result matrix.

A pipelined computation proceeds as follows: on the first instruction cycle, the addresses for the two matrices are generated. On the next instruction cycle, the values are fetched from the local memory. These values can then be multiplied on the next cycle since the memory ports are inputs to the multiplier. The product is then accumulated on the next cycle. Elements of the result matrix are then temporarily stored in the register file. A tight pipelined computation occurs because different resources are for each stage of computation, and each resource operates independently of each other.

Using the pipelined approach above, an N.times.N matrix multiplication would require (N2 +4N+6) instructions. The pipelining actually reduces the computation by an order of magnitude, since a matrix multiplication requires (2N3-N2) total arithmetic operations.

The specialized hardware matcher 604 efficiently counts the number of matches between arbitrarily long data sequences. The matcher is positioned in front of the ALU so that match counts can be accumulated. The matcher shares the instruction field with the multiplier in the Processor Instruction Word. Like the multiplier, the matcher uses the X and Y data sources and the P register for storing the result. This design decision was made since multiplication operations are orthogonal to matching operations (i.e., multiplication is not needed during matching, and matching is not needed during multiplication).

The matcher operates on data sequences of packed data words. Data is packed when two or more smaller sized data words are located into a single 32-bit word. The matcher can match two 32-bit values on each instruction cycle, storing the number of packed word matches in the P register. The values stored in the P register can then be accumulated by using the P register as an input to the ALU.

The 32-bit input words are interpreted by the matcher as packed data; each 32-bit word can represent multiple words of smaller sizes. Possible match formats for the input include:

______________________________________ Size of Word (bits) Number of Words ______________________________________ 1 32 2 16 3 10 4 8 5 6 6 5 7 4 8 4 16 2 32 1 ______________________________________

The 3, 5, 6, and 7 bit word formats have unused bits which are ignored by the matcher. The match format is defined in a setup instruction, which loads the B register (located in the matcher) with the match format. As an example, consider the two 32-bit data words presented in FIG. 11. If the B register is initialized to recognize packed words of size four, then five matches would be recorded.

The following extended example demonstrates how two sequences of packed data can be compared, and how match sequences that are not aligned on 32-bit word boundaries are handled. In this example, a short match sequence of 7 eight-bit words are used to match against a much larger sequence of data, represented by a one-dimensional array named D. The eight-bit words represent ASCII characters, and four ASCII characters can be packed into a 32-bit word. The match sequence must be compared to each
7 consecutive character subsequence of the larger D sequence. Thus, the match sequence must first be compared to characters 1:7 of D, then 2:8, 3:9, etc.

A complication arises with matching packed data sequences since the match does not necessarily begin on a 32-bit word boundary. A set of templates must be defined which represent all possible starting packed word positions within the 32-bit word. For the example presented, there are four packed words to the 32-bit word, and so the match can begin on the first, second, third, or fourth packed word within a 32-bit word of the D sequence. Thus, four match templates must be defined to cover these cases. FIG. 12 shows the set of match templates used for the 7 ASCII character sequence. The untilled portion of the templates are initialized to a character that is not used in the D sequence to ensure that no false matches occur.

To find an exact match, each template is matched with a subsequence of D of the same size. When all of the templates have been compared to the subsequence, the match sequence comparison is shifted relative to the D sequence by a 32-bit word. This is illustrated in FIG. 13. This ensures that all character subsequences of D are matched against the match sequence. The number of matches are stored in the P register, and then accumulated by the ALU.

An exact match is found by comparing the contents of the P register to the match sequence length. This comparison is executed in the ALU, by comparing the P register contents with the match sequence length. In the example, the match result should be compared to the number 7, the number of characters in the match sequence.

The following is an example of the use of a pseudocode program for obtaining matching information on a data sequence:

______________________________________ define N 1024 /* Number of 32-bit sequence words */ define PW 4 /* Number of packed words per 32-bit word */ define TL 3 /* Template Length in 32-bit words */ define LAST 2 /* ceiling(packed word sequence length in 32 bit words) */ int D[N] is the number of 32-bit sequence words; int M[PW*TL] holds the matching templates; int Result[N*4] is the match result array; int j;/* index variable to the array D */ int k;/* mod (TL*WPSW) index variable to M */ int r;/* index variable to the array Result */ j = k = r = 0; loop(N - (TL-1) /* number of subsequences in D */ loop(PW) /* number of match templates */ { mtotal = 0; loop(TL) /* length of template (32-bit words) */ {mtotal = mtotal + match(D[j++], M[k++]); } Result[r++]= mtotal; j = j - TL; } j++; } loop(LAST) /* last few match templates */ { mtotal = 0; loop(TL) /* length of template (32-bit words) */ { mtotal = mtotal + match( D[j++], M[k++]); } Result[r++] = mtotal; j = j - TL; } ______________________________________

Here, the data sequence D is defined as a 1024 element integer array D, the set of match templates are stored in the 12 element integer array M, and the match results for each subsequence is stored in the 4096 element integer array Result. There are 4090 7-character subsequences within the data sequence D.

The matching code is very efficient. The innermost loop computation can be pipelined between the match unit and the ALU. The index variable computations for the three arrays can all be handled by the processor Address Generators. The index variable k uses modulo 12 addressing arithmetic (providing appropriate bounding modes, as described above, for the Address Generator), resetting to zero after accessing the eleventh element of M. Finally, all of the loops use efficient loop hardware on the controller.

The execution mechanism that provides processors with the capability of conditionally executing code based on local data conditions is now described. An overview on conditional locking will be provided, followed by a description of the processor operations, the hardware requirements for implementing conditional locking will be defined and several pseudocode examples will be presented.

The processor state is `locked` when the processor performs NOPs (no-operation) instead of the instructions being sent to it from the instruction sequencer on the controller. Conversely, a processor is `unlocked` when it is executing the instructions sent by the controller. When a processor locks, it executes NOPs until a command is given by the controller to unlock. When a processor is locked, the IPC is still active, and certain bookkeeping operations are still executed by the processor to determine when it should unlock.

The conditional locking mechanism is efficient for implementing conditional code in a SIMD environment. Conditional code can be executed without a change in control flow, which incurs additional instruction overhead. The decision to change the processor state is made inside a Conditional Locking Code Segment (CLCS). A CLCS is an assembly language level construct that is delimited by begin and end statements. Each CLCS is associated with a Lock ID Number (LIN). Instructions within the CLCS lock and unlock the processor based on information from the Processor Status Word (PSW), described below.

A CLCS has a form similar to the `if-then-else` construct supported in most high level languages. There is a mutually exclusive execution condition between the `then` statement body and the `else` statement body either statement body can be executed by a processor, but not both. CLCSs can be nested, but cannot overlap. (If CLCS1 begins and then CLCS2 begins, CLCS2 must end before CLCS1 ends.)

The following operations are used to conditionally lock and unlock the processors:

1) Begin CLCS

2) End CLCS

3) Conditional Lock (on condition)

4) Conditional Unlock

5) Conditional Else

6) Interrupt Unlock

7) Interrupt Restore

8) NOP

The Begin CLCS and End CLCS are used to delimit the CLCS. The Conditional Lock instruction locks the processor if the condition given in the instruction is satisfied. The Conditional Unlock instruction unlocks all processors that are locked on the current (most closely nested) CLCS. The Conditional Else instruction unlocks all processors that have not executed code within the current CLCS and locks all processors that have executed code within the current CLCS. The Interrupt Unlock instruction is used when an interrupt occurs, or during a context switch to unlock all processors. The Interrupt Restore is used to restore the state of the processors before the Interrupt Unlock instruction was executed.

An example of CLCS is presented to demonstrate how a CLCS is similar to an if-then-else construct:

______________________________________ if (condition1) Begin CLCS then Conditional Lock (not condition1) statement 1 statement 1 else if (condition2) Conditional Else Conditional Lock (not condition2) statement 2 statement 2 else Conditional Else statement 3 statement 3 End CLCS ______________________________________

The following hardware support is used by the processor to support conditional locking:

ALIN counter: The active LIN number (ALIN) is the LIN number of the current CLCS that is being executed.

LIN register: The LIN value is the number of the CLCS on which the processor is locked. If the processor is unlocked, the ALIN and LIN are identical.

Cond register: The Cond register contains the PSW condition on which the processor locked.

C status bit: The C (Context) status bit, located in the PSW, determines the state of the processor. When the bit is set, the processor is locked, and when the bit is not set, the processor is unlocked.

X status bit: The X (Executed) status bit, located in the PSW, enforces the mutually exclusive execution of statement bodies within the CLCS. Suppressing the X bit suppresses the mutually exclusive property.

PLIN register: The Previous LIN register (PLIN) is where the LIN is stored when an interrupt occurs.

PCond register: The Previous Condition register (PCond) is where the Condition register is stored when an interrupt occurs.

PC status bit: The Previous Context (PC) status bit is where the C status bit is stored when an interrupt occurs.

PX status bit: The Previous Executed (PX) status bit is where the X status bit is stored when an interrupt occurs.

The general scheme works as follows. The value of the ALIN counter is incremented whenever a CLCS is entered, and decremented whenever a CLCS is exited. Thus, the ALIN value is equivalent to the CLCS nesting level. The ALIN value is the same on all processors, and is incremented and decremented even when the processor is locked.

A second value, called the LIN value, records which CLCS caused the processor to lock. This information is needed for the situation where there are nested CLCSs, and it must be determined if a processor locked on an outer CLCS or an inner CLCS. If a processor is unlocked, then the LIN value is the same as the ALIN value. When a processor is locked, the LIN number is less than or equal to the ALIN.

When a processor conditionally locks, the PSW is stored into the Cond register, and the LIN value no longer changes with the ALIN value. The C bit is then set in the PSW, which locks the processor.

A processor conditionally unlocks when a conditional unlock instruction is encountered in the code and the LIN value is the same as the ALIN value. Thus, the unlocking instruction always applies to the most closely nested CLCS. The processor is unlocked by one of four instructions: a Conditional Unlock, Conditional Else, Interrupt Unlock, or an End.sub.-- CLCS instruction (which signals the end of the conditional code segment).

The X bit is used to enforce a mutually exclusive execution property within the CLCS. When code is executed within a CLCS, the X bit is set on all unlocked processors. When the `else` clause of a CLCS is executed, the X bit is used to determine which processors have not yet executed. If a processor's X bit is still not set, it has not yet executed a statement body within the CLCS.

The manner in which hardware support used to implement the above-described Conditional Locking Operations is now described. Each operation is shown with pseudocode describing the operation of the conditional locking hardware, followed by an English description of how the instruction executes.

______________________________________ Begin CLCS ALIN = ALIN + 1 IF (C == 0) LIN = LIN + 1 X = 0 ENDIF ______________________________________

The ALIN is incremented (even if the processor is locked). If the processor is unlocked, the LIN value is also incremented, and the X bit is reset, since no code for the CLCS has been executed for the new CLCS.

______________________________________ End CLCS IF (LIN==ALIN) C = 0 LIN = LIN - 1 X = 1 ENDF ALIN = ALIN - 1 ______________________________________

If LIN==ALIN, then the processor's C bit can automatically be reset (to unlock the processors), since the CLCS is being exited. The LIN value is then decremented. The X bit is set because if the CLCS is nested, then the processor is executing a statement body of the next innermost nested CLCS. This relationship collapses the X bit information into 1 bit, instead of a number of X bits equal to the depth of the CLCS nesting. The ALIN must be decremented unconditionally, and so it appears outside of the following IF statement. "Conditional Lock (on condition)":

______________________________________ IF ( (C == 0) AND [(condition TRUE) OR (X==1)]) C = 1 Cond = PSW ELSE X = 1 ENDIF ______________________________________

If the processor is unlocked, and the condition is true or the X bit is set, then the processor is locked, and the PSW is stored in the Cond register. If the condition is false, then the processor remains unlocked. The X bit is then set, as a CLCS statement body will be executed.

Consider now the following IF statement. "Conditional Unlock":

______________________________________ IF ( (C==1) AND (LIN==ALIN) AND (X==0) ) C=0 ENDIF ______________________________________

If the processor is locked and the processor is locked on the most closely nested CLCS, and the processor has not executed a statement body for the CLCS yet, the processor is unlocked. If the processor was already unlocked, this instruction has no effect.

Consider now the following IF statement. "Conditional Else":

______________________________________ IF ( (C!=X) AND (LIN==ALIN) ) C=X ENDIF ______________________________________

This instruction will unlock all processors that have not executed on the current CLCS, and lock all processors that have executed on the current CLCS, providing the functionality of an `else` statement in an `if-then-else` statement.

The instruction for an "Interrupt Unlock" is:

PLIN=LIN

PC=C

PX=X

PCond=Cond

LIN=0

C=0.

This instruction saves the status of all the registers so that an interrupt can use LIN numbers without affecting the status of the program. All processors are unlocked so that they can respond to the interrupt.

The instruction for an "Interrupt Restore" is:

LIN=PLIN

C=PC

X=PX

Cond=PCond

This instruction restores the status of all the registers after an interrupt routine has finished.

FIG. 14 is an example of the translation from high level language pseudocode to low level code. Note that the translation is virtually one-to-one, with very little execution overhead. FIG. 14 also demonstrates how processors with different data condition execute different statements. Each processor executes a single statement Sx within the nested if statements; due in part to the Cond.sub.-- Else statement, which enforces the mutual exclusion property of the conditional statement.

The processors sometimes execute code that is dependent on the data, and so they may repeat execution on an operation until a condition is satisfied. When a processor is finished executing such an operation, to provide LOR (LOR) synchronization, it will set the LOR bit in its PSW to signal the controller that it has finished the computation. When all of the processors have signalled the controller and locked, the controller will send the signal to unlock and reset the LOR bit. Execution can then proceed.

As an example, consider the case of processors computing the value XY, where the values of X and Y are different on each processor. (Y-1) multiplications are required to compute the result, but this amount varies on each processor. The controller will send code to its processors to continually multiply a partial product by X until it has received the LOR signal to continue execution. The pseudocode program for this operation is:

______________________________________ P = 1 Count = Y+1 Begin.sub.-- CLCS Repeat until LOR signal received by all processors Decrement-and-Lock-On-Zero( Count ) P = P * X } End.sub.-- CLCS Reset LOR bit ______________________________________

In the pseudocode for Conditional Locking/LOR synchronization operation, the Decrement-and-Lock-On-Zero is a special instruction provided by the Auxiliary ALU. This instruction decrements the value in the ADR register and locks the processor if the result is zero.

In FIG. 15, the Processor Instruction Word (PIW), which is defined to be 128 bits long, is broadcast from the controller to the processors under its control. The instruction is sent as two 64-bit words which are time-multiplexed. The PIW, that has the instruction word format shown in FIG. 15, comprises a plurality of instruction fields, each of which is described below.

The P Instruction Field (1 bit) is a parity bit used for error checking. Even parity checking is implemented the total number of 1s in the instruction (including the parity bit) is always even. An error in the instruction can occur during transmission from the IO controller to the processor. Note that 1-bit errors will be detected, but 2-bit errors will go undetected.

The Mode Instruction field (2 bits) selects one of four the instruction word formats Mode 0, Mode 1, Mode 2, and Mode 3 shown in FIG. 15. The only difference between these instruction modes is the size of the immediate data field in the instruction word. The space for the immediate field overlaps the instruction fields specifying the RI2 and RO2-RO4 instruction fields. Thus, specifying immediate values limits the number of data transfers with the RF 608. Several sizes of immediate fields are defined to minimize the conflict with RF 608 access.

______________________________________ Mode Immediate Field RIx avail. ROx avail. ______________________________________ 0 None 1,2 1,2,3,4 1 32 bit 1 1 2 16 bit 1,2 1,4 3 8 bit 1,2 1,3,4 ______________________________________

The R Field (1 bit) sends the signal to refresh the (DRAM) local memory. A description of the IPC Instruction Field (8 bits) is provided later.

The Context Instruction Field (3 bits) controls the locking and unlocking of processors. Full detail of this control mechanism is described above. There are 8 instructions that are supported for conditionally locking and unlocking processors:

1) Begin CLCS

2) End CLCS

3) Conditional Lock (on condition)

4) Conditional Unlock

5) Conditional Else

6) Interrupt Unlock

7) Interrupt Restore

8) NOP

The ALU Instruction Field format (19 bits) is:

(1) I/F Select (Common with Multiplier Instruction Field)

(8) ALU operations

(2) Source A Select

(2) Source B Select

(1) ACC1 Enable

(1) ACC2 Enable

(1) ACC1 H/L Select

(1) ACC2 H/L Select

(2) Output Shift

The ALU Instruction Field specifies the operations and data sources for the ALU. A 1 bit I/F Select specifies whether the ALU operates in integer or floating point mode. The 8 bit ALU Operation field specifies what ALU function is executed the operations are listed below. The 2 bit Source A Select specifies one of four data sources, and the 2 bit Source B Select specifies one of four data sources as ALU operands. Two 1 bit fields determine whether the ACC1 and ACC2 registers should be updated. When the ALU is not being used by an instruction, the accumulator values are preserved. Two 1 bit fields determine whether the High or Low 32-bit word of ACC1 and ACC2 are the input to some other data source. A 2 bit Output Shift field specifies a normalizing shift for the output of the ALU.

Data Sources for Source A and Source B:

______________________________________ Source A: P Source B: ACC2 ACC1 R02 R01 IMD IPC MR1 ______________________________________

ALU Arithmetic operations:

A+B

A+B+1

A+B+Carry

A-B

B-A

COND (A-B) (if A>=B then A-B, else A)

COND (B-A)

ADDSUB (A, B) (A+B->ACC1, A-B->ACC2)

ADDSUB (B, A)

ALU Logic operations:

A AND B

A OR B

A XOR B

ALU Unary operations:

ABS (A)

ABS (S)

1s COMP (A)

1s COMP (B)

2s COMP (A)

2s COMP (B)

PASS (A)

PASS (B)

WORD SWAP (A)

WORD SWAP (B)

FF0 (A) (Find first zero)

FF0 (B)

FF1 (A) (Find first one)

FF1 (B)

ALU Shift operations:

Shift Right (A>>B)

Shift Left (A<<B)

Shift (A<<B), (B is signed)

ALU Special operations:

MAXMIN (A, B)

BITR (A , B) (REVERSE(A)>>B)

SEXTEND (A, B) (sign extend A from bit B)

PACK (A, B)

BEGIN.sub.-- DIVIDE

CONT.sub.-- DIVIDE

END.sub.-- DWDE

ALU Conversion operations:

INT.sub.-- TO.sub.-- FLOAT (A)

INT.sub.-- TO.sub.-- FLOAT (B)

FLOAT.sub.-- TO.sub.-- INT (A)

FLOAT.sub.-- TO.sub.-- INT (B)

ALU Setup Modes:

ALU CLIP/NO ALU CLIP

Roundoff/Truncate

The Auxiliary ALU (AuxALU) Instruction Field (4 bits) specifies the operation that are executed using the AuxALU, located near the RI1 port of the register file. The AuxALU is used for (conditionally) incrementing or decrementing the data in the ADR. There is a 4-bit Operation field the AuxALU Operations are:

1) Increment Value on Condition

2) Increment Value on Inverted Condition

3) Increment Value Unconditionally

4) Decrement Value on Condition

5) Decrement Value on Inverted Condition

6) Decrement Value Unconditionally

7) Decrement Value on Condition and Lock if Value is Zero

8) Decrement Value on Inverted Condition and Lock if Value is Zero

9) Decrement Value Unconditionally and Lock if Value is Zero

10) Load ACMR

11) Load ADR

13) Write ACMR

14) Write ADR

15) NOP

In order to increment or decrement on a condition, a PSW mask must be loaded into the ACMR. If the condition specified by the mask is satisfied, the operation is executed on the value stored in the ADR. The operations to load the ACMR and ADR read the data from the RI1 port. There are operations which use the inverted condition specified by the mask, as not all the conditions are explicit in the PSW. (Many conditions are mutually exclusive, such as zero and non-zero.) The operations listed above that decrement and lock on a zero value are used for executing data dependent operations such as power (x,y) (x to the yth power). The value of y would be decremented through the AuxALU while the partial product of multiplying the x values is being computed. When the y value decrements to zero, the sets the LOR bit in the PSW and locks. When the controller receives the LOR signal, the controller sends the instruction to unlock the processors.

The Multiplier/Match Select Instruction Field (1 bit) determines whether the multiplier or the matcher is active. Both resources cannot be active at the same time, and so the instruction fields for the two resources overlap. When the instruction field specifies one resource, the other resource performs a NOP for that instruction cycle. The matcher has a NOP instruction that is specified when both resources must execute a NOP.

The Multiplier Instruction Field (6 bits) format is:

(1) Operation

(2) Source X Select

(2) Source Y Select

(1) 1S/2S Select

A 1-bit Operation field selects the operation for the multiplier. The 2 bit fields Source X Select and Source Y Select choose one of four data sources for the X and Y source inputs into the multiplier. The 1-bit 1S/2S Select field determines whether the multiplier will operate in one's complement or two's complement format. The 1-bit I/F Select specifies whether the multiplier operates in integer or floating point mode. This bit is located in the ALU Instruction Field; the ALU and multiplier both operate in the same mode.

The multiplier can perform two operations; (as specified by the 1-bit Operation field) multiplication, or load P register direct with a 64-bit value. The Source X Select and Source Y Select fields specify the location of the upper and lower
32-bit words, respectively, that is to be loaded into the P register.

Data Sources for Source X and Source Y:

______________________________________ Source X: IMD Source Y: IPC MR1 MR2 RO3 RO4 ACC1 ACC2 ______________________________________

The Matcher Instruction Field (5 bits) format is:

(1) Operation

(2) Source X Select

(2) Source Y Select

(4) B Select (field is mutually exclusive with Source X, Y fields)

A 1-bit Operation field selects the operation for the matcher. The 2-bit fields Source X Select and Source Y Select choose the data sources for the X and Y source inputs. The 4-bit B Select field is mutually exclusive with the Source X and Y Select fields, and is used in the match setup instruction.

There are two operations that the matcher performs: matching, and match setup. When the matcher performs a match operation, the Source X Select and Source Y Select specify the data sources for the matcher's X and Y inputs. It can match two
32-bit values on each instruction cycle. The number of recorded matches is stored in the P register.

The matcher interprets the two 32-bit input values as packed data. Data is packed when two or more smaller sized data words are located in a single 32-bit word. The B register within the matcher selects the packed data format used by the matcher. For more information on how the matcher records matches for packed data, see the above description of an extended example on how two sequences of packed data can be compared, and how match sequences that are not aligned on 32-bit word boundaries are handled.

When a match setup operation is specified, the 4-bit B Select field specifies a value to be loaded into the matcher B register. Legal B values include 1-8, 16, 32, and B (no change). Note that if no change is specified for the B value, this means both the multiplier and the matcher are performing a NOP.

Data Sources for Source X and Source Y:

______________________________________ Source X: IMD Source Y: IPC MR1 MR2 RO3 RO4 ACC1 ACC2 ______________________________________

The RI1 Instruction Field (11 bits) format is:

(7) Register File Address

(1) Write Enable

(3) Write Data Source

The RI1 port is used to write values to the 128 word RF 608. The 7 bit Register File Address field specifies the destination RF word. The 1-bit Write Enable field determines whether or not the specified RF word is to be updated. The 3-bit Write Data Source field specifies the source of the data transfer. The following are registers/fields are sources for the RI1 port:

______________________________________ RI1: ACC1 IMD P(H) IPC MR1 CR RO1 PSW ______________________________________

The RI2 Instruction Field format (10 bits) is:

(7) Register File Address

(1) Write Enable

(2) Write Data Source

The RI2 port operation is identical to that of the RI1 port, but uses an 2-bit Write Data Source field instead. The following are registers are sources for the RI2 port:

______________________________________ RI2: ACC2 P(L) MR2 RO2 ______________________________________

The ROx Instruction Field (8 bits for each field) format is:

(7) Register File Address

(1) Read Enable

The RO1-RO4 registers are used to temporarily hold values read from the 128 word RF 608. The 7-bit Register File Address field determines which word from the RF 608 is to be read into the register. The 1-bit Read Enable field determines whether or not the register is to be updated. Each of the registers RO1-RO4 are data sources for other processor components:

______________________________________ RO1: ALU(A) RO3: MPY(X) MW1 MA1 RI1 IPCDR CR RO2: ALU(B) RO4: MPY(Y) MW2 MA2 RI2 IPCOR/CID PSW ______________________________________

The Immediate (IMD) Field (32 bits, 16 bits, or 8 bits) exists when the Mode field is nonzero, as described above. The size of the field varies with the Mode value, and the field overlaps RIx and ROx fields. The IMD field is used as input to the following sources:

______________________________________ IMD: MPY(X) MA1 ALU(B) MW2 RI1 IPCOR/CID ______________________________________

The Address Generator Mode Bit (1 bit) determines whether the address generator is operating in address generation mode, or in setup mode. The setup mode is responsible for loading and storing address generator register sets.

Address Generator 1, 2 Instruction Fields (10 bits for each field) have two modes, setup mode and address generation mode; the mode is determined by the Address Generator Mode Bit described above

In AG mode, the Instruction Field has the following format:

(2) Addressing Modes

(2) DS Select (mutually exclusive with Stride Select field)

(2) Address Arithmetic Mode Select

(2) Stride Select (mutually exclusive with DS Select field)

(4) Base Register Select (overlaps BLOS Register Select field)

(3) BLOS Register Select

A complete description of the Address Generator Instruction Field is described above in connection with the Address Generator. The DS Select for the two address generators are:

______________________________________ AG1 DS: IMD MR1 ACC2 RO3 AG2 DS: IPC MR2 ACC1 RO4 ______________________________________

In setup mode, the Instruction Field has the following format:

(2) Read/Write Enable, NOP

(2) Direct Source Select

(3) Register Select

(3) Register Number

The 2-bit Read/Write Enable field determines whether a register is read into the address generator register file set, or written to the RAM. When a write is made to the RAM, the corresponding RAM instruction field must also specify a write. When the address generator writes a register value to the RAM, the write overrides the Write Data Select field selection in the RAM field. The 2-bit DS Select selects the source for loading data into the address generator register file set.

The 3-bit Register Select chooses the register set to be loaded. The register sets are: 1) UB0-UB7, 2) User Limit Registers (UL0-UL7, 3) BR0-BR7, 4) Base Registers (BR8-BR15, 5) Limit Registers (LR0-LR7, 6) Limit Registers (LR8-LR15, 7) OR0-OR7,
8) SR0-SR7.

The 3-bit Register Number field chooses which register within the set of eight is the active register. More detail on this can be found above.

The RAM Instruction Field (3 bits for each field) format is: (1) Read/Write and (2) Write Data Select

There are two independent read/write ports to the local memory, and there are two 3 bit instruction fields to independently control access to the memory. Each Random Access Memory (RAM) Instruction field controls access to the memory local to the processor. The 1 bit Read/Write field determines whether a data value is to be read from the memory, or written to the memory. If the data is being written to memory, the 2-bit Write Data Select field determines the data source whose contents are written to memory. The exception to this is when the Address Generator is in S mode and is writing to RAM.

______________________________________ RAM 1 Data Sources: IPC MR1 ACC1 RO1 RAM 2 Data Sources: IMD MR2 ACC2 RO2 ______________________________________

The Processor Status Word (PSW) is a 32-bit register in each processor that contains information on that processor state after the execution of the last operation. Information on the result of an ALU operation, address generator, and processor state is found in the PSW.

The following ALU Status Bits (8 bits) are retained for compatibility with the PE. The two groups of eight status bits are complementary.

False (F) The bit is a constant zero.

Carry (C) The bit is set when the ALU generates a carry.

>0 (GT) The bit is set when the ALU result is greater than zero.

0(GE) The bit is set when the ALU result is >=0.

Valid (VAL) The bit is set when the ALU result is valid.

Underflow (UF) The bit is set when the ALU result underflows.

Overflow (OF) The bit is set when the ALU result overflows.

Zero (Z) The bit is set when the ALU result is zero.

The additional two bits are used for floating point arithmetic. Inexact (INE) The bit is set when a floating point result has been rounded or truncated. NotANumber (NaN) The bit is set when an a word is not a number.

The following Address Generator Status Bits (2 bits) are from the address generators. The bits are set if an array offset is outside the bounds of the array. These bits are set when the next offset is computed to be out of bounds for a BLOS addressing operation, or on the present offset for any other addressing operation. (See the above description of the Address Generators for more detail.)

OutOfBound 1 (OOB1) An array offset is outside the bounds of an array. (from Address Generator 1)

OutOfBound2 (OOB2) An array offset is outside the bounds of an array. (from Address Generator 2)

Processor Conditional Locking Status Bits The following Processor Conditional Locking Status Bits (4 bits) determine the execution state of the processor, and are used in operations that conditionally lock and unlock the processor. See above for more detail.

Context (C) This status bit locks and unlocks the processor.

PrevContext (PC) The context bit is stored when an interrupt occurs, so that the context can be restored afterward.

Executed (X) This status bit is used to determine whether the processor has executed on the current LIN number. This bit is used to enforce the mutual exclusion property of conditional execution.

PrevExecuted (PX) The executed bit is stored when an interrupt occurs, so that it can be restored afterward.

The following two bits are used to signal the sequencer.

LOR (LOR) This status bit is sent to the processor's controller to signal that an event has occurred on the processor. An example event is when a processor has signalled the controller that a data dependent operation has been completed. Alternatively, the LOR could be used as a 1-bit communication mechanism.

The following IPC Status Bits (2 bits) display status information for the processors IPC Operations (described fully later in detail).

IPC Parity Error (IPCP) The bit is set when there is a parity error in the IPC data when IPC Mode 1 operations are being executed.

IPC Reduction (IPCR) The bit is set when a reduction operation is needed to handle the incoming data, and an operation does not occur.

The following Image Vault Status Bit (1 bit) is used by the Image Vault (IV) to signal that it has completed loading data into the local memory.

The IV Finished (IVF) bit is set when the IV data is loaded.

Of the 32 bits, 12 bits of the status word are currently undefined.

The IPC is the primary channel through which data is transferred between processors. The IPC has a linear array network connectivity. Data can be moved in regular communication patterns, such as data shifts, bypasses, and broadcasts, or in arbitrary one-to-one or one-to-many communication patterns. The IPC Logic on each processor also has the capability of performing reduction operations on the IPC data such as sum, min, max, and, or, or xor operations.

Since the IPC is incorporated into the processor design, there is low latency communication. Processors that are up to four processors away from each other can transfer data once per processor instruction cycle. The linear array connectivity of the IPC reduces communication to one dimension, which simplifies routing and fabrication. The IPC reduction operations provide additional functionality to the processors, increasing the on-chip parallelism. Also, there is a mode of operation (called the IPC Tagged Mode) that supports a random access read and write capability, therefore providing a virtual crossbar communication capability in the SE.

The IPC is 64 bits wide with two parity bits and can operate at 400 MHz for a throughput of 3.2 G Bytes/sec. It is implemented as dual 33-bit channels and can operate at one or four times the instruction clock speed of the processor.

The IPC operates off of two instruction sources. An 8-bit field from the PIW specifies whether or not the IPC is active, and controls the loading and storing of IPC registers. The other instruction source is the 64-bit IPC Operation Register (IPCOR), which determines the specific IPC operations to be executed by the processor. This implementation means each processor can specify a unique IPC operation. IPC operations are MIMD.

The IPC operates in one of two basic modes: IPC Mode, and IPC Tagged Mode. In the Channel mode, the 33-bit IPCs are independently programmable. Each IPC can shift the data left or right on the channel, bypass the data left or right on the channel, or broadcast data to other processors. FIG. 16a shows a right shift of the IPC. The bypass operation allows processors to be excluded from a shift operation. In FIG. 16b, processors 5, 6, and 7 are bypassed, and so processor 8 receives the data from processor 4. In the broadcast operation, the processor that is the source of the communication sends the value to the neighboring processors. These processors in turn shift the data along the channel. (FIG. 16c) Processors defined as the sink of a broadcast, such as processors 6 and 7 in FIG. 16d, do not continue to pass the data when it is received. Processor 7 is both a source and a sink of its local broadcast.

In the IPC Tagged Mode, the IPC operates as a single 66-bit channel. This mode is used to provide arbitrary one-to-one and one-to-many communication. In this mode, a tag called the Communication ID (CID) field is associated with the data. Every processor that is to be the recipient of the data loads the same CID value in its CID register (CIDR). The IPC is then shifted at the maximum speed (4 shifts/cycle), and the matching hardware in the IPC Logic loads the IPC Data Register (IPCDR) with the tagged data when its CID value matches the value in the CIDR field.

Before IPC operation, processors 0, 1, and 4 load tagged data onto the IPC, and all processors specify the tagged data to be received, as indicated in table (a) below. After the IPC operation, all processors have received the data associated with the tag specified in the CIDR, as indicated in table (b) below.

______________________________________ CID tag 5 20 10 Data A B C (a) Before IPC Operation Processor 0 1 2 3 4 5 CIDR 20 10 5 20 20 20 IPCDR .circle-solid. .circle-solid. .circle-solid. .circle-solid. .circle-solid. .circle-solid. (b) After IPC Operation Processor 0 1 2 3 4 5 CIDR 20 10 5 20 20 20 IPCDR B C A B B B ______________________________________

In addition to the 13-bit CID field, the 66-bit IPC Tagged operation word contains a 50-bit data field, a 2-bit tag field, and an even parity bit. The 2-bit tag field is user defined, but the tag value 00 is reserved to assert that the data is invalid. The data field is user formatted, and the least significant 32-bits can be masked for IPC reduction operations. Possible uses for the additional data field bits include a return CID, so that a value can be returned to the processor originating the tagged data, or a memory address or array offset could be specified in the additional data field bits, so that the receiving processor can associate a memory location with the data being sent.

The processors are connected in a linear array via the IPC, and data is transferred between processors through the IPC. Data in the IPC can be shifted to the left or to the right. The bus can operate at two speeds: one or four shifts per instruction cycle. At the slower speed, the IPC operates at the same speed as the processors, and data is shifted from one processor to a neighboring processor in one cycle. The faster speed can shift data to four processors away in a single cycle.

The IPC has two separately controlled 32 bit channels, named Channel 1 and Channel 2, whose data can either be shifted to the left or to the right. When 64-bit data is transmitted through the IPC, Channels 1 and 2 must be programmed identically. Alternatively, two 32-bit values can be sent on the IPCs. These 32-bit values can move in the same direction or in opposite directions, as specified by the IPCOR. IPC Operations are described below.

Operations on the IPC are determined from two instruction sources. There is an 8-bit IPC Instruction Field which is specified in the PIW, and a 64-bit IPC Operation which is loaded into the IPCOR. The IPC Instruction field is common for all processors (since it appears in the Processor Instruction Word), whereas the specified IPC Operation is local to the processor.

The IPC Instruction Field (8 bits) is located in the P1W. It has the following subfields:

(1) Run/Stop

(2) Load IPCDR (Preserve, Load H&L, Load L, Load H)

(2) Source Select for IPCDR

(1) Load IPCOR (Preserve, Load)

(1) Load CIDR (Preserve, Load)

(1) Source Select for IPCOR, CIDR

The 1-bit Run/Stop field determines whether the IPC is active or inactive on the current instruction. The 2-bit Load IPCDR determines if and how the 64-bit IPCDR is loaded. The four modes are: Preserve contents of IPCDR, Load IPCDR(L), Load IPCDR(H), and Load IPCDR(L,H). In the last case, the low and high word of the IPCDR are loaded with the same 32-bit value. The 2-bit Source Select determines which source will be loaded into the IPCDR. The 32-bit sources are: RO3, ACC1, ACC2, and MR1. The 1-bit Load IPCOR loads or preserves the value of IPCOR. The 1-bit Load IPC CIDR loads or preserves the value of CIDR. The 1-bit Source Select for IPCOR and CIDR determines which source will be loaded. IPCOR and CIDR have common sources they are IMD and MR2.

IPC Operations (64 bits) are stored on the processor, as the operations are data and processor dependent. An IPC Operation is a 64-bit value that is loaded into the IPCOR. Each operation is actually two 32-bit operations that control the IPCs independently. The high 32 bits control IPC 1, and the low 32 bits control the IPC 2. If a 64-bit value is to be communicated through the IPC, the high and low words of the IPC operation must be identical. There are two types of operations: IPC Operations and IPC Tagged Operations.

IPC Operations are similar to the IPC operations supported on the PE. In this mode, both IPCs are separately programmable. A 64-bit value is transmitted through the IPC by programming the two channels identically. There are three types of operations: shifting, bypassing, and broadcasting.

IPC Tagged Operations are designed for arbitrary communication between a set of processors. In this mode, both IPCs must be used together to transmit a 64-bit word. The word is comprised of a message number, CID and data. The sender of data will assign a CID to the 64-bit word being sent, and all processors that receive the data must have the same CID loaded in the CIDR. In this way, one-to-one and one-to-many communication protocols can be supported. Alternatively, a processor uses its processor ID as the CID, and a range of processors are specified as recipients of the data. The data format is left up to the programmer, and can include such information as a return CID.

For IPC Tagged Operations, after the data is loaded into IPCDR, the IPC contents are shifted on each cycle for a duration of time determined by the sequencer. The sequencer has a user-programmable counter which determines the number of cycles needed to send data to its destination. Each processor compares the data that has shifted into its IPC and compares the CID of the data with the value in its CIDR. If the two CID values match and the tag bits of the word are non-zero, the 64-bit word is loaded into the IPCDR.

An IPC operation (27 bits) includes IPC operations such as shifting, bypassing and broadcasting data on the IPC. These operations independently control the IPCs, so two different operations can be executing at once. (The IPC 1 instruction is stored in the upper 32 bits of IPCOR, and the IPC 2 instruction is stored in the lower 32 bits.)

IPC operations have the following 27-bit instruction field format:

(1) Mode field (set to Channel Mode)

(1) IPCDR High/Low Select

(1) IPC Speed (1 shift/cycle, 4 shift/cycle)

(1) Enable Boundary Value

(3) Reduction Operation

(1) Left/Right Directional Bit

(2) Operation (Shift, Bypass, Broadcast, NOP)

(1) Broadcast Send

Broadcast Send NOP

(2) Broadcast Receive

Broadcast Receive Left Boundary

Broadcast Receive Right Boundary

Broadcast Receive NOP

(13) Capture Cycles

(1) Repeat Operation

The 1-bit Mode field specifies the instruction is an IPC Operation. The 1-bit IPCDR H/L Select, determines whether the high or low word of IPCDR is read by other processor components. A 1-bit IPC Speed field determines whether the IPC is operating at the same speed as the processor (one shift/cycle), or at four times the processor speed (four shifts/cycle). There is a 1-bit Enable Boundary Value field which specifies whether the processor should shift its data value to the next processor. Enabling the boundary value prevents interference between several independent IPC operations that are using the IPC at the same time. The 3-bit reduction operation field is common to both modes.

IPC Operations have a one bit field to determine the direction of the IPC, which is either left or right. A 2-bit operation field determines whether a shift, bypass, broadcast, or NOP is executed. If a broadcast operation is executed, a 1-bit broadcast send field determines whether the processor is the originator of the broadcast. A 2-bit field determines how the processor is participating in the broadcast receive. A processor can either receive a data value and pass it on to its neighboring processors on the IPC, or, if one of the boundary specifications is selected, it serves as the sink of the broadcasted value. A left boundary broadcast receive specifies that the processor is the leftmost processor on the IPC to receive the data; a right boundary broadcast receive specifies it is the rightmost processor. Since there are 32 bits in an IPC Operation, 5 bits are currently unused for each channel.

FIG. 16 provides a few high level views of the IPC performing shift, bypass, and broadcast operations. The registers represent the IPCDR on each processor. The top picture demonstrates a right shift on the bus. The second picture demonstrates a bypass operation, where three processors have been bypassed. In this example, a bypass pattern has been specified that makes the first and fifth (counting from the left) processors logical neighbors. A single right shift from the first processor shifts the data into the fifth processor. (It must be understood that the operation is not necessarily occurring in one instruction cycle; if many processors are bypassed, it may take several instructions to shift data to the next logically connected processor.) In the third picture, the third processor from the left is broadcast its value. In the bottom picture, several processors are broadcasting. The second and fourth processors from the left are executing a Broadcast Send instruction, while the third processor is executing a Broadcast Receive Right Boundary, and the fourth processor is executing a Broadcast Receive Left Boundary; this is the way to specify sinks for the broadcast, and prevent local broadcasts from interfering with each other.

An IPC Tagged Operation (62 bits) allows arbitrary communication between a set of processors. These operations use the IPC as a single 64-bit channel. For Tagged operations, a counter in the sequencer is loaded with the number of cycles needed to complete the communication. When the counter decrements to zero, IPC communication is completed and the sequencer is signaled that the communication has been completed. A Tagged operation has two data formats; these formats determine how the CID is interpreted.

IPC Tagged operations have the 62-bit instruction field format:

(1) Mode field (set to Tagged Mode)

(1) IPCDR High/Low Select

(1) IPC Speed (1 shift/cycle, 4 shift/cycle)

(1) Enable Boundary Value

(3) Reduction Operation

(1) IPC Data Range format

(11) Left Shift Cycles.times.4

(11) Right Shift Cycles.times.4

(32) Reduction Mask

The 1-bit Mode field specifies the instruction is an IPC Tagged Operation. The 1-bit IPCDR H/L Select, determines whether the high or low word of IPCDR is read by other processor components. A 1-bit IPC Speed field determines whether the IPC is operating at the same speed as the processor (one shift/cycle), or at four times the processor speed (four shifts/cycle). There is a 1-bit Enable Boundary Value field which specifies whether the processor should shift its data value to the next processor. Enabling the boundary value prevents interference between several independent IPC operations that are using the IPC at the same time. The 3-bit reduction operation field is common to both modes.

A 1-bit IPC Data Range format specifies one of two legal data formats for interpreting CID values. There are two 11-bit fields for specifying how far the data is shifted to the left and right of the processor. The value specified is scaled by four, so specifying a one in the field means the data will be shifted four times in that direction. The 32-bit reduction mask applies to the least significant 32 bits of data in the IPCDR, and specify which bits in the word are subject to the reduction operation. There are 2 undefined bits.

In Tagged Data format 1, the CID value is interpreted as a communication ID number. Any processor that has the matching CID number in its CIDR will receive the data.

The format of the 64-bit data word on the IPC is: (1)Even Parity Bit, (13) CID field, (2) Tag bit field and (50)Data field.

In this format, a 1-bit Even Parity bit is used to detect errors. A 13-bit CID field contains the value to be matched by the destination processors. There is a user-defined 2-bit tag bit field. If the field is non-zero, then meaningful data is in the 64-bit word. (Although the tag bits are user-defined, the tag bit pattern `00` is reserved.) The 50-bit data field is for data. It is the responsibility of the programmer to decide on a data format.

One possible data format that could be used for the 50-bits is: (13) Return CID address, (32) Data. Another possible data format is: (18) Offset into an array, (32) Data (to store/read in/from the array).

An example of how communication occurs using IPC Tagged Data format 1 is shown below in Tables (a) and (b). Table (a) shows that before IPC Operation, all processors have a CID value loaded into their CIDR.

______________________________________ CID tag 5 20 10 Data A B C (a) Before IPC Operation Processor 0 1 2 3 4 5 CIDR 20 10 5 20 20 20 IPCDR .circle-solid. .circle-solid. .circle-solid. .circle-solid. .circle-solid. .circle-solid. ______________________________________

The processors then put their local data into the IPCDR along with a CID tag. The bus is then shifted at a high speed, and the matching hardware tries to match the CID value in the tagged data with the value in its CIDR. If they match, the data is loaded into the IPCDR. Therefore, after IPC Operation, the result is that shown in Table (b).

______________________________________ (b) After IPC Operation Processor 0 1 2 3 4 5 CIDR 20 10 5 20 20 20 IPCDR B C A B B B ______________________________________

If several words of data on the IPC have the same CID value, then the resulting value placed in the IPCDR is dependent on the IPC reduction operator.

In Tagged Data format 2, the CID value is interpreted as the processor number for the processor. Any processor whose processor number is defined by the CID and Range fields will receive the data.

The format of the 64-bit data word on the IPC is: (1) Even Parity Bit, (13) CID field, (2) Tag bit field, (8) Range field, (42) Data field.

A 1-bit Even Parity bit is used to detect errors. A 13-bit CID field contains the value of the processor ID. In this mode, the CID field is loaded with the processor ID. Them is a user-defined 2-bit tag bit field. If the field is non-zero, then meaningful data is in the 64-bit word. (Although the tag bits are user-defined, the tag bit pattern `00` is reserved.) The 8-bit Range field specifies a contiguous range of processors (the Range value indicates that processors between [CID] and [CID+Range] will receive the data). A 42-bit Data field for data; the programmer must decide on a data format for the field.

The example shown in the following two Tables demonstrates how IPC Tagged Data format 2 works. In this example, processor 0 is going to send the data `A` to processors 2-5. Initially, each processor puts its logical processor number into the CIDR and processor 0 specifies a CID of 2 with a range of 3, as shown in the following first Table.

______________________________________ Initially: ______________________________________ PROC: 0 1 2 3 4 5 6 CIDR: 0 1 2 3 4 5 6 Range: 3 .circle-solid. .circle-solid. .circle-solid. .circle-solid. .circle-solid. .circle-solid. Data: A .circle-solid. .circle-solid. .circle-solid. .circle-solid. .circle-solid. .circle-solid. Data ID: 2 .circle-solid. .circle-solid. .circle-solid. .circle-solid. .circle-solid. .circle-solid. ______________________________________

After the IPC operation, processors 2-5 have the correct data value, as shown in the following second Table.

______________________________________ After IPC Op: ______________________________________ PROC: 0 1 2 3 4 5 6 Received: .circle-solid. .circle-solid. A A A A .circle-solid. ______________________________________

The Reduction Operation Field (3 bits) is common to both IPC and Tagged operations. It specifies a reduction operation to be performed on the data on the IPC. If the field specifies that the reduction should not occur, and it is needed, a bit in the PSW is set. There are eight reduction operations:

______________________________________ 1) XOR 5) Max 2) AND 6) Min 3) OR 7) Sum 4) Replace 8) Sort ______________________________________

In Channel Mode, the data field is the 32 bit value in the IPCDR for the specified channel (Channel 1 uses the high 32 bits of IPCDR and Channel 2 uses the low 32 bits of IPCDR). In Tagged Mode, the data field is variable, and is defined by the
32-bit Reduction Mask, which is applied to the least significant 32 bits of data. The specified reduction operation is performed on the word received via the IPC bus and the IPCDR. The result of the operation is supplied to the processor as the signal IPC. In FIG. 6, the signal IPC may be written into the local memory via the address generator 610-1, stored into a register in the RF 608, applied as the Y operand of the match unit 604 or multiplier 602 or as the A operand of the ALU 600.

Data reduction operations occur as follows. A data value received by the IPC logic 612 via the IPC bus is one operand and the value held in the IPCDR is the other operand. Once the operation is performed, the result is stored in the IPCDR, replacing the original contents. Two of the operations specified above, Replace and Sort, are better understood with some explanation. By the Replace operation, the value received via the IPC bus replaces the original contents of the IPCDR. By the Sort operation, the larger operand is placed in the 32 MSB positions of the IPCDR while the smaller operand is placed in the 32 LSB positions.

The IOMC is responsible for all data transfer between the SE and all external sources. The SE is organized into cylinders; each cylinder contains a processor, local memory, and IOMC. A cylinder is organized so that the only form of communication between the IOMC and the processor is through the local memory. Thus, processor I/O is memory mapped, and it is the responsibility of the controller and the IOMC to ensure that the data transfer between external sources and local memory is executed properly.

The IOMC has connections to three main I/O Channels: a Data Input Channel (DIC), Data Output Channel (DOC), and a Host I/O Channel (HIOC); they handle data transfer between video sources, video destinations, and the host workstation, respectively. The DIC and the DOC are connected to the IOMC through processor interfaces called the Input Slice and Output Slice.

The Host I/O Bus (HIO) is a 32-bit bidirectional channel connecting the Host Workstation to the IOMCs. The channel connects the IOMCs in a linear array, with the host sitting on the left end of the HIO. The channel has a data rate of 200
MB/sec.

The DIC is a 48-bit unidirectional channel simultaneously connecting the IOMC with up to 4 Video Sources. The DIC is comprised of 4 independently controlled 12-bit serial channels, each of which operate off different clocks (as each channel could be reading a different Video Source). The DIC connects the IOMCs in a linear array, with the Video Sources sitting on the left end of the DIC. The channel transmits data from left to right on the bus. The DIC is connected to the IOMC via the Input Slice. The channel operates at a maximum speed of 86 MHz and has a data rate of 1.2 GB/sec.

The DOC is a 48-bit unidirectional channel simultaneously connecting the IOMC with up to 4 Video Destinations. Like the DIC, the DOC is comprised of 4 independently controlled 12-bit serial channels, which each operate off of different clocks (as each channel could be writing a different Video Destination). There is a mode where a DOC will operate off of the DIC clock if the video input and output channels are transmitting data in the same format and speed. The DOC connects the IOMCs in a linear array, with the Video Destinations sitting on the left end of the DIC. The bus transmits data from right to left on the bus. The DIC is connected to the IOMC via the Output Slice. The channel operates at a maximum speed of 86 MHz and has a data rate of 1.2 GB/sec.

The Input Slice shown in FIG. 17, is the IOMC interface for the DIC and comprises an Input Controller 1700, four 64.times.32-bit FIFOs 1702-1 to 1702-4 one for each DIC, and the hardware that interfaces with the DIC. Each of FIFOs 1702-1 to
1702-4 includes a formatter (FMT) for changing the 12 bit input thereto into a 32 bit output therefrom. The data from the DIC can either be directed through the FIFO and into local memory, or it can be passed on the DIC to the Input Slice of the next IOMC on the linear array. Alternatively, data from the Output Slice of the previous IOMC can be routed into the Input Slice of the IOMC. Controller 1700 is responsible for two functions: controlling what data is loaded into FIFOs