United States Patent5212777
Gove , ; et al.May 18, 1993

Title

Multi-processor reconfigurable in single instruction multiple data (SIMD) and multiple instruction multiple data (MIMD) modes and method of operation

Abstract

There is disclosed a multiprocessor system arranged, in one embodiment, as an image and graphics processor. The processor is structured with several individual processors all having communication links to several memories without restriction. A crossbar switch serves to establish the processor memory links. The entire image processor, including the individual processors, the crossbar switch and the memories are contained on a single silicon chip. Each processor can operate to execute the same instruction at the same time (SIMD mode) or different instructions at the same time (MIMD mode).


Inventors:Gove; Robert J. (Plano, TX), Balmer; Keith  (Bedford, GB2), Ing-Simmons; Nicholas K.  (Bedford, GB2), Guttag; Karl M.  (Missouri City, TX)
Assignee:Texas Instruments Incorporated (Dallas, TX)
Appl. No.:437858
Filed:November 17, 1989

Current U.S. Class:712/229 711/118 711/121 712/200 
Field of Search:364/2MSFile,9MSFile 395/375,325,425

U.S. Patent Documents
4435758March 1984Lorie et al.
4562535December 1985Vincent et al.
4644496February 1987Andrews
4747043May 1988Rodman
4792894December 1988Artz et al.
4873626October 1989Gifford
4891787January 1990Gifford
4992933February 1991Taylor
5050065September 1991Dartois et al.
Foreign Patent Documents
WO89-00734Jan., 1989WO
Other References
"The Connection Machine", by W. D. Hillis, published in The MIT Press (1985). .
"Handling Real Time Images Comes Naturally to Systolic Array Chip", by Hannaway, Shea, Bishop, in Electronic Design, pp. 289-300 (1984). .
"Systolic Array Chip Recognizes Visual Patterns Quicker than a Wink", by W. W. Smith, P. Sullivan in Electronic Design, pp. 257-266 (1984). .
"Real Time 3D Object Tracking in a Rapid Prototyping Environment", by Robert J. Gove, in Electronic Imaging '88 (1988). .
"Integration of Symbolic and Multiple Digital Signal Processors with the Explorer/Odyssey for Image Processing and Understanding Applications", by Robert J. Gove, in Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 968-971 (May, 1987). .
"The Use of Parallel-Processing Computers in Digital Image Processing", by Lew Brown, published by Alliant Computer Systems Corp. .
"Vitec Parallel C Compiler", by T. Butler, published by Visual Information Technologies, Inc., Plano, Tex., pp. 741-747. .
"A Single Board Image Computer with 64 Parallel Processors", by Stephen Wilson, published in Electronic Imaging '87 International Electronic Imaging Exposition & Conference (1987) pp. 470-475. .
"The Androx Parallel Image Array Processor", by Wayne Threatt, in Electronic Imaging '87, International Electronic Imaging Exposition & Conference (1987), pp. 1061-1064. .
"Design of a Massively Parallel Processor", by Kenneth Batcher, IEEE Transactions on Computers, v. C-29, No. 9 (1980). .
"High Resolution Frame Grabbing and Processing Through Parallel Architecture", by Daniel Crevier, published by Coreco, Inc. Quebec, Canada. .
"Multiple Digital Signal Processor Enironment for Intelligent Signal Processing", by Gass, et al., in Proceedings of the IEEE, v. 75, No. 9 (Sep. 1987) pp. 1246-125. .
"Architecture and Design of the Mars Hardware Accelerator", Agrawall, et. in 24th ACM/IEEE Design Automation Conference (1987), pp. 101-107. .
"Digital Video & Image Processors", by O'Brien, Mather & Holland, published by Plessey Semiconductors (1989). .
"An Architectual Study, Design and Implementation of Digital Image Acquistion, Processing and Display Systems with Micro-Processor-Based Personal Computers and Charge-Coupled Device Imaging Technology", by Robert J. Gove, SMU (1986). .
"A Medium Grained Parallel Computer for Image Processing", by R. S. Cok, published by Digital Technology Center, Eastman Kodak Co., Rochester, N.Y..~
Primary Examiner: Lee; Thomas C.
Assistant Examiner: Geckil; Mehmet
Attorney, Agent or Firm:Marshall, Jr.; Robert D. Kesterson; James C. Donaldson; Richard L.

Claims


What is claimed is:
1. A multi-processor system operable in either a single instruction multiple data (SIMD) mode or in a multiple instruction multiple data (MIMD) mode comprising:
a SIMD/MIMD mode register storing therein an indication of either the single instruction multiple data (SIMD) mode or the multiple instruction multiple data (MIMD) mode;
a plurality of processors, each processors having a data port and an instruction port and operating from instructions provided to said instruction port for controlling a process including movement of data to and from said data port;
a plurality of memories, said plurality of memories including at least one data memory corresponding to each of said processors and an instruction memory corresponding to each of said processors;
a switch matrix connected to said SIMD/MIMD mode register, to each of said plurality of processors and each of said plurality of memories, said switch matrix including
a set of first links connected to said memories,
a set of second links connected to said data ports of said processors,
a third link having a plurality of sections equal in number to the number of said processors, each section connected to said instruction port of a corresponding one of said processors,
a plurality of buffers disposed between adjacent sections of said third link connecting said adjacent sections of said third link when said SIMD/MIMD mode register indicates the single instruction multiple data (SIMD) mode, and splitting said adjacent sections of said third link when said SIMD/MIMD mode register indicates the multiple instruction multiple data (MIMD) mode, and
a plurality of crosspoints disposed at intersections between said first links and said second links and at intersections between said first links and said sections of said third link, said crosspoints individually operating to connect said first and second links permitting said data port of a processors to access a memory, and to connect said first links and said third link permitting said instruction port of a processor to receive an instruction from an instruction memory,
said plurality of crosspoints including a first crosspoint disposed at the intersection of said section of said third link connected to said instruction port of a predetermined first processor and said first link connected to said corresponding instruction memory which is always enabled to permit connection,
said plurality of crosspoints including a set of second crosspoints disposed at the intersection of said section of said first link connected to said instruction port of processors other than said predetermined first processor and said respective first link connected to said corresponding instruction memories which are disabled to prohibit connection when said SIMD/MIMD register indicates the single instruction multiple data (SIMD) mode and enabled to permit connection when said SIMD/MIMD register indicates the multiple instruction multiple data (MIMD) mode;
whereby said instruction port of each processor is connected to said instruction memory corresponding to said predetermined first processor in the single instruction multiple data (SIMD) mode via said first crosspoint, said sections of said third link and said buffers, and said instruction port of each processor is connected to said corresponding instruction memory in the multiple instruction multiple data (MIMD) mode via said corresponding second crosspoint and said corresponding section of said third link.

2. The multi-processor system claimed in claim 1, further comprising:
a communications bus permanently interconnecting all of said processors, said communication bus for coordinating change from the SIMD and MIMD modes.

3. The multi-processor system claimed in claim 2, wherein at least a portion of one of said memories is a parameter store memory storing a plurality of parameters and said communications bus includes circuitry for exchanging signals which specifies certain parameters stored in said parameter store memory.

4. The multi-processor system claimed in claim 2, wherein said parameters are stored in and retrieved from said parameter store memory via said switch matrix.

5. The multi-processor system claimed in claim 1, wherein: each processor includes
a program counter having stored therein an address of a next instruction, and
an instruction fetch circuit for supplying to said instruction port said address stored in said program counter for fetching an instruction stored at said address; and said switch matrix wherein
said buffers pass signals unidirectionally whereby when said buffers connect said adjacent sections of said third link in the single instruction multiple data (SIMD) mode only said program counter of said predetermined first processor accesses said instruction memory corresponding to said predetermined first processor.

6. The multi-processor system claimed in claim 5, further comprising:
an external memory for storing instructions for each of said processors;
each instruction memory stores only a portion of the instructions for the corresponding processor; and
each processor further includes
a cache logic circuit connected to said program counter for determining if the instruction corresponding to the address stored in said program counter is stored in said corresponding instruction memory,
said cache logic circuit requesting the instruction corresponding to the address stored in said program counter from said external memory if the instruction is not stored in said corresponding instruction memory when said SIMD/MIMD register indicates the multiple instruction multiple data (MIMD) mode,
said cache logic circuit of each processor except said predetermined first processor disabled when said SIMD/MIMD register indicates the single instruction multiple data (SIMD) mode, and
said cache logic circuit of said predetermined first processor requesting the instruction corresponding to the address stored in said program counter from said external memory if the instruction is not stored in said corresponding instruction memory when said SIMD/MIMD register indicates the single instruction multiple data (SIMD) mode.

7. The multi-processor system claimed in claim 1, wherein: said switch matrix wherein
said plurality of crosspoints including a set of third crosspoints disposed at the intersection of said first links connected to said instruction memories and said second links connected to said data ports of said plurality of processors, said set of third crosspoints disabled to prohibit connection when said SIMD/MIMD register indicates the single instruction multiple data (SIMD) mode and enabled to permit connection when said SIMD/MIMD register indicates the multiple instruction multiple data (MIMD) mode;
whereby access to said instruction memories via said data ports of said plurality of processors is prohibited when in the single instruction multiple data (SIMD) mode and permitted when in the multiple instruction multiple data (MIMD) mode.

8. The multi-processor system claimed in claim 1, wherein:
said SIMD/MIMD mode register stores therein an indication of either the single instruction multiple data (SIMD) mode of the multiple instruction multiple data (MIMD) mode for each of said plurality of processors;
said switch matrix wherein
said buffers connect adjacent sections of said third link in a serial chain from a first processor through to a last processor, each of said plurality of buffers connecting said adjacent sections of said third link only when said SIMD/MIMD mode register indicates the single instruction multiple data (SIMD) mode for a corresponding processor, and splitting said adjacent sections of said third link when said SIMD/MIMD mode register indicates the multiple instruction multiple data (MIMD) mode for the corresponding processor,
said second crosspoints being disabled to prohibit connection when said SIMD/MIMD register indicates the single instruction multiple data (SIMD) mode for a corresponding processor and enabled to permit connection when said SIMD/MIMD register indicates the multiple instruction multiple data (MIMD) mode for the corresponding processor;
wherein some of said plurality of processors may operate in the multiple instruction multiple data (MIMD) mode and while serial chains of processors operate in the single instruction multiple data (SIMD) mode, said instruction port of each processor in a serial chain of processors in the single instruction multiple data (SIMD) mode being connected to said instruction memory corresponding to a first processor in the serial chain, and said instruction port of each processor in the multiple instruction multiple data (MIMD) mode being connected to said corresponding instruction memory.

9. The multi-processor system claimed in claim 8, wherein: each processor includes
a program counter having stored therein an address of a next instruction, and
an instruction fetch circuit for supplying to said instruction port said address stored in said program counter for fetching an instruction stored at said address; and said switch matrix wherein
said buffers pass signals unidirectionally whereby when said buffers connect said adjacent sections of said third link in the single instruction multiple data (SIMD) mode only said program counter of the first processor in a serial chain accesses said its corresponding instruction memory.

10. The multi-processor system claimed in claim 8, wherein: said switch matrix wherein
said plurality of crosspoints including a set of third crosspoints disposed at the intersection of said first links connected to said instruction memories and said second links connected to said data ports of said plurality of processors, said set of third crosspoints disabled to prohibit connection when said SIMD/MIMD register indicates the single instruction multiple data (SIMD) mode for a corresponding processor and enabled to permit connection when said SIMD/MIMD register indicates the multiple instruction multiple data (MIMD) mode the corresponding processor;
whereby access to said instruction memories via said data ports of any of said plurality of processors is prohibited when said processor is in the single instruction multiple data (SIMD) mode and permitted when said processor is in the multiple instruction multiple data (MIMD) mode.

11. A multi-processor system operable in either a single instruction multiple data (SIMD) mode or in a multiple instruction multiple instruction multiple data (MIMD) mode for each of said plurality of processors:
a SIMD/MIMD mode register storing therein an indication of either the single instruction multiple data (SIMD) mode of the multiple instruction multiple data (MIMD) mode for each of said plurality of processors;
a plurality of processors, each processors having a data port and an instruction port and operating from instructions provided to said instruction port for controlling a process including movement of data to and from said data port;
a plurality of memories, said plurality of memories including at least one data memory corresponding to each of said processors and an instruction memory corresponding to each of said processors;
a switch matrix connected to said SIMD/MIMD mode register, to each of said plurality of processors and each of said plurality of memories, said switch matrix including
a set of first links connected to said memories,
a set of second links connected to said data ports of said processors,
a third link having a plurality of sections equal in number to the number of said processors, each section connected to said instruction port of a corresponding one of said processors,
a plurality of buffers disposed between adjacent sections of said third link forming a serial chain from a first processor to a last processor, each buffer connecting said adjacent sections of said third link when said SIMD/MIMD mode register indicates the single instruction multiple data (SIMD) mode for a corresponding processor, and splitting said adjacent sections of said third link when said SIMD/MIMD mode register indicates the multiple instruction multiple data (MIMD) mode for the corresponding processor, and
a plurality of crosspoints disposed at intersections between said first links and said second links and at intersections between said first links and said sections of said third link, said crosspoints individually operating to connect said first and second links permitting said data port of a processors to access a memory, and to connect said first links and said third link permitting said instruction port of a processor to receive an instruction from an instruction memory,
said plurality of crosspoints including a first crosspoint disposed at the intersection of said section of said third link connected to said instruction port of a predetermined first processor and said first link connected to said corresponding instruction memory which is always enabled to permit connection,
said plurality of crosspoints including a set of second crosspoints disposed at the intersection of said section of said first link connected to said instruction port of processors other than said predetermined first processor and said respective first link connected to said corresponding instruction memories which are disabled to prohibit connection when said SIMD/MIMD register indicates the single instruction multiple data (SIMD) mode and enabled to permit connection when said SIMD/MIMD register indicates the multiple instruction multiple data (MIMD) mode,
said plurality of crosspoints including a set of third crosspoints disposed at the intersection of said first links connected to said instruction memories and said second links connected to said data ports of said plurality of processors, said set of third crosspoints disabled to prohibit connection when said SIMD/MIND register indicates the single instruction multiple data (SIMD) mode and enabled to permit connection when said SIMD/MIMD register indicates the multiple instruction multiple data (MIND) mode; whereby said instruction port of each processor is connected to said instruction memory corresponding to a first processor in a serial chain in the single instruction multiple data (SIMD) mode, said instruction port of each processor in the multiple instruction multiple data (MIMD) mode is connected to said corresponding instruction memory, and access to said instruction memories via said data ports of said plurality of processors is prohibited when the corresponding processor is in the single instruction multiple data (SIMD) mode and permitted when the corresponding processor is in the multiple instruction multiple data (MIMD) mode.

12. The multi-processor system claimed in claim 11, further comprising:
a communications bus permanently interconnecting all of said processors, said communication bus for coordinating change from the SIMD and MIMD modes.

13. The multi-processor system claimed in claim 12, wherein:
at least a portion of one of said memories is a parameter store memory storing a plurality of parameters; and
said communications bus includes circuitry for exchanging signals which specifies certain parameters stored in said parameter store memory.

14. The multi-processor system claimed in claim 12, wherein:
said parameters are stored in and retrieved from said parameter store memory via said switch matrix.

15. The multi-processor system claimed in claim 12, further comprising:
an external memory for storing instructions for each of said processors;
each instruction memory stores only a portion of the instructions for the corresponding processor; and
each processor further includes
a cache logic circuit connected to said program counter for determining if the instruction corresponding to the address stored in said program counter is stored in said corresponding instruction memory,
said cache logic circuit requesting the instruction corresponding to the address stored in said program counter from said external memory if the instruction is not stored in said corresponding instruction memory when said SIMD/MIMD register indicates said processor is in the multiple instruction multiple data (MIMD) mode,
said cache logic circuit of each processor not a first processor of a serial chain of processors in the single instruction multiple data (SIMD) mode being disabled, and
said cache logic circuit of a first processor of a serial chain of processors in the single instruction multiple data (SIMD) mode requesting the instruction corresponding to the address stored in said program counter from said external memory if the instruction is not stored in said corresponding instruction memory.

16. The multi-processor system claimed in claim 11, wherein: each processor includes
a program counter having stored therein an address of a next instruction, and
an instruction fetch circuit for supplying to said instruction port said address stored in said program counter for fetching an instruction stored at said address; and said switch matrix wherein
said buffers pass signals unidirectionally whereby when said buffers connect said adjacent sections of said third link in the single instruction multiple data (SIMD) mode only said program counter of a first processor in a serial chain of processors in the single instruction multiple data (SIMD) mode accesses its corresponding instruction memory.

Description

TECHNICAL FIELD OF THE INVENTION

This invention relates generally to multiprocessor systems and more particularly to such systems and methods where the several processors are capable of executing the same instruction at the same time or capable of executing different instructions.

CROSS REFERENCE TO RELATED APPLICATIONS

All of the following patent applications are cross-referenced to one another, and all have been assigned to Texas Instruments Incorporated. These applications have been concurrently filed and are hereby incorporated in this patent application by reference.

______________________________________ U.S. Pat. application Serial No. Title ______________________________________ 437,591 Multi-Processor With Crossbar Link of Processors and Memories and Method of Operation 437,856 Reconfigurable Communications for Multi- Processor and Method of Operation 437,852 Reduced Area of Crossbar and Method of Operation 437,853 Synchronized MIMD Multi-Processors, System and Method of Operation 437,946 Sliced Addressing Multi-Processor and Method of Operation 437,857 Ones Counting Circuit and Method of Operation 437,851 Memory Circuit Reconfigurable as Data Memory or Instruction Cache and Method of Operation 437,854 Imaging Computer and Method of Operation 437,875 Switch Matrix Having Integrated Crosspoint Logic and Method of Operation ______________________________________

BACKGROUND OF THE INVENTION

In the world of computers and processors there is an unrelenting drive for additional computing power and faster calculation times. In this context, then, systems in which several processors can be combined to work in parallel with one another are necessary.

Imaging systems which obtain visual images and perform various manipulations with respect to the data and then control the display of the imaged and stored data inherently require large amounts of computations and memory. Such imaging systems are prime candidates for multi-processing where different processors perform different tasks concurrently in parallel. These processors can be working together in the single instruction, multiple data mode (SIMD) where all of the processors are operating from the same instruction but obtaining data from various sources, or the processors can be working together in the multiple instruction, multiple data mode (MIMD) where each processor is working from a different set of instructions and working on data from different sources. For different operations, different configurations are necessary.

Some multi-processing systems are optimized for handling SIMD. In these systems all of the processors operate from a single instruction contained in a particular memory. In this situation, the system is set up to allow one memory to provide the instructions while the data is moved to and from other memories by processors assigned, at least for a period of time, to each other memory. In MIMD systems, each processor is typically assigned a particular memory for both instruction purposes and for data transfer.

One problem with attempts to run a multi-processing system that alternates between SIMD and MIMD modes is that vast amounts of switching, control and synchronism are required to allow for the vast number of permutations that can occur.

There is thus a need in the art for a system which handles a multi-processor having multi-memories such that the system can easily, on a cycle-by-cycle basis (if necessary) switch between the SIMD and MIMD modes on an interchangeable basis.

One method of solving the huge interconnection problem in complex systems such as the image processing system shown in one embodiment of the invention is to construct the entire processor as a single device. Conceptually this might appear easy to achieve, but in reality the problems are complicated.

First of all, an architecture must be created which allows for the efficient movement of information while at the same time conserving precious silicon chip space. The architecture must allow a very high degree of flexibility since once fabricated, it cannot easily be modified for different applications. Also, since the processing capability of the system will be high, there is a need for high band width in the movement of information on and off the chip. This is so since the physical number of leads which can attach to any one chip is limited.

It is also desirable to design an entire parallel processor system, such as an image processor, on a single silicon chip while maintaining the system flexible enough to satisfy wide ranging and constantly changing operational criteria.

It is further desirable to construct such a single chip parallel processor chip where the processor memory interface is easily adaptable to operation in various modes, such as SIMD and MIMD, as well as adaptable to efficient on-off chip data communications.

SUMMARY OF THE INVENTION

These problems have been solved by designing a multiprocessing system to handle image processing and graphics and by constructing a crossbar switch capable of interconnecting any processor with any memory in any configuration for the interchange of data. The system is capable of connecting n parallel processors to m memories where m is greater than n.

The problems inherent with constructing a single chip image processor having a high degree of versatility have been solved by the architecture of establishing a multilink, multi-bus crossbar switch between the individual processors and the individual memories. This architecture, coupled with the design of the high density switch, allows the system to perform in both the SIMD and MIMD modes and allows for full access of all processors to all memories. The crossbar switch is constructed with different length links serving different functions so as to conserve space while still providing a high degree of operational flexibility.

In one embodiment a transfer processor operates to control on-chip/off-chip communications while a master processor serves to control communications to a common memory. In operation, any processor can access any of a number of memories, while certain memories are dedicated to handling instructions for the individual processors.

A processing system operates with a plurality of processors arranged to operate independently from each other from instructions executed on a cycle-by-cycle basis, with the processors accessing a plurality of memories and having circuitry for interconnecting any of the processors with any of the memories and with further circuitry for arranging a group of the processors into the SIMD operating mode where all of the processors of the group operate from the same instruction and with circuitry operable on a processor cycle by cycle basis for changing at least some of the processors from the group of processors from operation in the SIMD operating mode to operation in the MIMD operational mode where each processor of the group operates from separate instructions provided by separate instruction memories.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention and for further advantages thereof, reference is now made to the following detailed description taken in conjunction with accompanying drawings in which

FIG. 1 shows an overall view of the elements of the image processing system;

FIG. 2 shows an alternative view of the elements of the image processing system;

FIG. 3 shows a series of image processing systems interconnected together into an expanded system;

FIG. 4 shows details of the crossbar switch matrix interconnecting the parallel processors and the memories;

FIG. 5 shows a prior art parallel processor configuration using shared memories;

FIG. 6 shows a prior art parallel processor configuration using distributed memories;

FIG. 7 shows an improved configuration;

FIG. 8 shows a prior art SIMD processor configuration;

FIG. 9 shows a prior art MIMD processor configuration;

FIG. 10 shows some reconfigurable modes of operations of an improved multi-processor;

FIG. 11 is a graph showing some algorithms and control for the image processing system;

FIG. 12 shows an example of the pixel data flow in the SIMD mode;

FIG. 13 shows an example of the pixel data flow in the MIND mode using spliced addressing;

FIG. 14 shows an example of data access in the SIMD in accordance with this invention;

FIG. 15 shows an example of data accessing the MIMD mode in accordance with this invention;

FIG. 16 shows the interrupt polling communication between the processors;

FIG. 17 shows a schematic representation of the layout of the processors and memory interconnected by the crossbar switch;

FIGS. 18 and 19 show details of the crosspoints of the crossbar switch;

FIG. 20 is a graph of wave forms of the contention logic for memory access;

FIG. 21 shows the relationship between the synchronization register of each processor and the synchronization bus;

FIG. 22 shows further details of the synchronization register and synchronization logic within each processor;

FIG. 23 is a graph showing the processes and waveforms for processor synchronization;

FIG. 24 shows an example of pixel data distribution using sliced addressing;

FIG. 25 shows an example of a prior art address adder;

FIG. 26 shows the address adder of this invention using sliced addressing;

FIG. 27 shows an example of the arithmetic employed in sliced addressing;

FIG. 28 shows details of the rearrangement of the instruction data memory for the SIMD/MIMD operational modes;

FIG. 29 shows details of a master processor;

FIG. 30 shows the general structure of the parallel processors;

FIG. 31 shows further detail of the structure of the program flow control unit of each parallel processor;

FIG. 32 shows further detail of the structure of the address unit of each parallel processor;

FIG. 33 shows further detail of the structure of the data unit of each parallel processor;

FIG. 34 shows the status register of each parallel processor;

FIG. 35 is a graph of waveforms of the pipeline sequence for a cache miss;

FIG. 35 is a graph of waveforms of the pipeline sequence for contention resolution;

FIG. 36 is a graph of waveforms of the pipeline sequence for loop control;

FIG. 37 is a graph of waveforms of the pipeline sequence for a branch or call instruction;

FIG. 38 is a graph of waveforms of the pipeline sequence for a branch or call instruction;

FIG. 39 is a graph of waveforms of the pipeline sequence for an interrupt;

FIG. 53 is a functional block diagram of an imaging system;

FIG. 54 is a logic schematic of the ones counting circuit matrix;

FIG. 55 is a logic schematic of a minimized matrix of the ones counting circuit;

FIG. 56 is an example of an application of a ones counting circuit;

FIG. 57 shows a block diagram of the transfer processor;

FIG. 58 shows a block diagram of the parallel processor system used with a VRAM; and

FIGS. 59-64 show various operational mode relationships.

DETAILED DESCRIPTION OF THE INVENTION

Prior to beginning a discussion of the operation of the system, it may be helpful to understand how parallel processing systems have operated in the prior art.

FIG. 5 shows a system having parallel processors 50-53 accessing a single memory 55. The system shown in FIG. 5 is typically called a shared memory system where all of the parallel processors 50-53 share data in and out of the same memory 55.

FIG. 6 shows another prior art system where memory 65-68 is distributed with respect to processors 60-63 on a one-for-one basis. In this type of system, the various processors access their respective memory in parallel and thus operate without memory contention between the processors. The system operating structures shown in FIGS. 5 and 6, as will be discussed hereinafter, are suitable for a particular type of problem, and each is optimized for that type of problem. In the past, systems tended to be either shared or distributed.

As processing requirements become more complex and the speed of operation becomes critical, it is important for systems to be able to handle a wide range of operations, some of which are best performed in the shared memory mode, and some of which are best performed in a distributed memory mode. The structure shown in FIGS. 1 and 2 accomplishes this result by allowing a system to have parallel processing working both in the shared and in the distributed mode. While in these modes, various operational arrangements such as SIMD and MIMD can be achieved.

Multi-Processors and Memory Interconnection

As shown in FIG. 1, there is a set of parallel processors 100-103 and a master processor 12 connected to a series of memories 10 via a cycle-rate local connection network switch matrix 20 called a crossbar switch. The crossbar switch, as will be shown, is operative on a cycle by cycle basis to interconnect the various processors with the various memories so that different combinations of distributed and shared memory arrangements can be achieved from time to time as necessary for the particular operation. Also, as will be shown, certain groups of processors can be operating in a distributed mode with respect to certain memories, while other processors concurrently can be operating in the shared mode with respect to each other and with respect to a particular memory.

Another view of the system is shown in FIG. 2 in which the four parallel processors 100, 101, 102, 103 are shown connected to memory 10 via switch matrix 20 which is shown in FIG. 2 as a distributed bus. Also connected to memory 10 via crossbar switch 20 is transfer processor 11 and master processor 12. Master processor 12 is also connected to data cache 13 via bus 171 and instruction cache 14 via bus 172. The parallel processors 100 through 103 are interconnected via communication bus 40 so that the processors, as will be discussed hereinafter, can communicate with each other and with master processor 12 and with transfer processor 11. Transfer processor 11 communicates with external memory 15 via bus 21.

Also in FIG. 2, frame controllers 170 are shown communicating with transfer processor 11 via bus 110. Frame controllers 170 serve to control image inputs and outputs as will be discussed hereinafter. These inputs can be, for example, a video camera, and the output can be, for example, a data display. Any other type of image input or image output could also be utilized in the manner to be more fully discussed hereinafter.

Crossbar switch 20 is shown distributed, and in this form tends to mitigate communication bottlenecks so that communications can flow easily between the various parts of the system. The crossbar switch is integrated on a single chip with the processors and with the memory thereby further enhancing communications among the system elements.

Also, it should be noted that fabrication on a chip is in layers and the switch matrix may have elements on various different layers. When representing the switch pictorially, it is shown in crossbar fashion with horizontals and verticals. In actual practice these may be all running in the same direction only separated spatially from one another. Thus, the terms horizontal and vertical, when applied to the links of the switch matrix, may be interchanged with each other and refer to spatially separated lines in the same or different parallel planes.

Digressing momentarily, the system can operate in several operational modes, one of these modes being a single instruction multiple data (SIMD) mode where a single instruction stream is supplied to more than one parallel processor, and each processor can access the same memory or different memories to operate on the data. The second operational mode is the multiple instruction, multiple data mode (MIMD) where multiple instructions coming from perhaps different memories operate multiple processors operating on data which comes from the same or different memory data banks. These two operational modes are but two of many different operational modes that the system can operate in, and as will be seen, the system can easily switch between operational modes periodically when necessary to operate the different algorithms of the different instruction streams.

Returning briefly to FIG. 1, master processor 12 is shown connected to the memories via crossbar switch 20. Transfer processor 11, which is also shown connected to crossbar switch 20, is shown connected via bus 21 to external memory 15. Also note that as part of memory 10, there are several independent memories and a parameter memory which will be used in conjunction with processor interconnection bus 40 in a manner to be more fully detailed hereinafter. While FIG. 2 shows a single parameter memory, in actuality the parameter memory can be several RAMS (random access memories) per processor which makes communication more efficient and allows the processors to communicate with the RAMS concurrently.

FIG. 4 shows a more detailed view of FIGS. 1 and 2 where the four parallel processors 100-103 are shown interconnected by communication bus 40 and also shown connected to memory 10 via crossbar switch matrix 20. The various crosspoints of the crossbar switch will be referred to by their coordinate locations starting in a lower left corner with 0-0. In the numbering scheme, the vertical number will be used first. Thus, the lower left corner crosspoint is known as 0-0, and the one immediately to the right in the bottom row would be 1-0. FIG. 19 which will be discussed hereinafter, shows the details of a particular crosspoint, such as crosspoint 1-5. Continuing now in FIG. 4, the individual parallel processors, such as parallel processor
103, are shown having a global data connection (G), a local data connection (L) and an instruction connection (I). Each of these will be detailed hereinafter, and each serves a different purpose. For example, the global connection allows processor 103
to be connected to any of the several individual memories of memory 10, which can be for data from any of the various individual memories.

The local memory ports of the parallel processors can each address only the memories that are served by three of the vertical switch matrix links immediately opposite the processors. Thus, processor 103 can use verticals 0, 1 and 2 of crossbar
20 to access memories 10-16, 10-15 and 10-14 for data transfer in the MIMD mode. In addition, while in the MIMD mode, memory 10-13 supplies an instruction stream to processor 103. As will be seen, in SIMD mode all of the instructions for the processors come from memory 10-1. Thus, instruction memory 10-13 is available for data. In this situation, the switch is reconfigured to allow access via vertical 4 of crossbar 20. The manner in which crossbar 20 is reconfigured will be discussed hereinafter.

As shown in FIG. 4, each parallel processor 100-103 has a particular global bus and a particular local bus to allow the processor access to the various memories. Thus, parallel processor 100 has a global bus which is horizontal 2 of crossbar 20, while parallel processor 101 has a global bus which is horizontal 3 of crossbar 20. Parallel processor 102 has as its global bus horizontal 4, while parallel processor 103 has as its global bus horizontal 5.

The local buses from all of the processors share the same horizontal 6. However, horizontal 6, as can be seen, is separated into four portions via three-state buffers 404, 405 and 406. This effectively provides isolation on horizontal 6 so that each local input to each processor can access different memories. This arrangement has been constructed for efficiency of layout area on the silicon chip. These buffers allow the various portions to be connected together when desired in the manner to be detailed hereinafter for the common communication of data between the processors. This structure allows data from memories 10-0, 10-2, 10-3 and 10-4 to be distributed to any of the processors 100-103.

When the processor is operating in the MIMD operational mode, the instruction port of the processors, for example, the instruction port of processor 103, is connected through crosspoint 4-7 to instruction memory 10-13. In this mode crosspoints
4-2, 4-3, 4-4, 4-5 and 4-6, as well as 4-1, are disabled. In this mode crosspoint 4-0 is a dynamically operative crosspoint, thereby allowing the transfer processor to also access instruction memory 10-13, if necessary. This same procedure is available with respect to crosspoint 9-7 (processor 102) and crosspoint 14-7 (processor 101).

When the system is in the SIMD mode crosspoint 4-7 is inactive, and crosspoints 4-2 through 4-6 may be activated, thereby allowing memory 10-13 to become available for data to all of the processors 100-103 via vertical 4 of crossbar 20. Concurrently, while in the SIMD mode buffers 401, 402 and 403 are activated, thereby allowing instruction memory 10-1 to be accessed by all of the processors 100-103 via their respective instruction, inputs. If buffer 403 is activated, but not buffers
401 and 402, then processors 100 and 101 can share instructor memory 10-1 and operate in the SIMD mode while processors 102 and 103 are free to run in MIMD mode out of memories 10-13 and 10-9 respectively.

Crosspoints 18-0, 13-0, 8-0 and 3-0 are used to allow transfer processor 11 to be connected to the instruction inputs of any of the parallel processors. This communication can be for various purposes, including allowing the transfer processor to have access to the parallel processors in situations where there are cache misses.

FIG. 7 is a stylized diagram showing the operation of parallel processors 100-103 operating with respect to memories 55 and 55A in the shared mode (as previously discussed with respect to FIG. 5) and operating with respect to memories 65-68 in the distributed mode (as previously discussed with respect to FIG. 6). The manner of achieving this flexible arrangement of parallel processors will be discussed and shown to depend upon the operation of crossbar switch 20 which is arranged with a plurality of links to be individually operated at crosspoints thereof to effect the different arrangements desired.

Before progressing to discuss the operation of the crossbar switch, it might be helpful to review FIG. 3 and alternate arrangements where a bus 34 can be established connected to a series of processors 30-32, each processor having the configuration shown with respect to FIGS. 1 and 2. External memory 35 is shown in FIG. 2 as a single memory 15, the same memory discussed previously. This memory could be a series of individual memories, both local and located remotely. The structure shown in FIG. 3 can be used to integrate any number of different type of processors together with the image system processor discussed herein, assuming that all of the processors access a single global memory space having a unified addressing capability. This arrangement also assumes a unified contention arrangement for the memory access via bus 34 so that all of the processors can communicate and can maintain order while they each perform their own independent operations. Host processor 33 can share some of the policing problems between the various processors 30-32 to assure an orderly flow of data via bus 34.

Image Processing

In image processing there are several levels of operations that can be performed on an image. These can be thought about as being different levels with the lowest level being simply to message the data to perform basic operations without understanding the contents of the data. This can be, for example, removal of extraneous specks from an image. A higher level would be to operate on a particular portion of the data, for example, recognizing that some portion of the data represents a circle, but not fully understanding that the circle is one part of a human face. A still higher operational aspect of image processing would be to process the image understanding that the various circles and other shapes form a human image, or other image, and to then utilize this information in various ways.

Each of these levels of image processing is performed most efficiently with the processors operating in a particular type of operational mode. Thus, when operations are performed on data locally grouped together without an attempt to understand the entire image, it is usually more efficient to use the SIMD operational mode where all, or a group of, processors operate from a single instruction and from multiple data sources. When operating in a higher mode where image pixel data is required from various aspects of the entire image in order to understand the entire image, the most efficient operational mode would be the MIMD mode where the processors each operate from individual instructions.

It is important to understand that when the system is operating in the SIMD mode, the entire pixel image can be processed through the various processors operating from a single instruction stream. This would be, for example, when the entire image is to be cleaned, or the image is enhanced to show various corners or edges. Then all of the image data passes through the processors in the SIMD mode, but at any one time data from various different areas of the image cannot be processed in a different manner for different purposes. The general operational characteristic of a SIMD operation is that at any period of time a relatively small amount of the data with respect to the entire image is being operated on. This is followed, in sequential fashion, by more data being operated on in the same manner.

This is in contrast to the MIMD mode where data from various parts of the image is being processed concurrently, some using different algorithms. In this arrangement, different instructions are operating on different data at the same time to achieve a desired result. A simple example would include many different SIMD algorithms (like clean, enhance, extract) operating concurrently or pipelined on many different processors. Another example with MIMD would include the implementation of algorithms with the same data flow although using unique arithmetic or logical functions.

FIGS. 8 and 9 show the prior art form of the SIMD and MIMD processors with their respective memories. These are the preferred typologies for SIMD/MIMD for image processing. The operational modes of the system will be discussed more fully with respect to FIGS. 59-64. In general, data paths 80 of FIG. 8 corresponds to data paths 6010, 6020, 6030 and 6040 of FIG. 60, while processors 90 of FIG. 9 corresponds to processors 5901, 5911, 5921, 5931 of FIG. 59. The controller (6002 of FIG. 60) for the data paths is not shown in FIG. 8.

Reconfigurable SIMD/MIMD

FIG. 10 shows the reconfigurable SIMD/MIMD topology of this invention where several parallel processors can be interconnected via crossbar switch 20 to a series of memories 10 and can be connected via a transfer processor 11 to external memory
15, all on a cycle by cycle basis.

One of the problems of operating in the MIMD topology is that data access can require high bandwidth as compared to operation in the SIMD mode where the effective data flow is on a serial basis or is emulated in the topology. Thus, in the SIMD mode, the data typically flows sequentially through the various processors from one processor to the next. This can be a blessing as well as a problem. The problem arises in that all of the data of the image has to be processed in order to arrive at a certain point in the processing. This is accomplished in the SIMD mode in a serial fashion. However, the MIMD mode solves this type of a problem because data from the individual memories can be obtained at any time in the cycle, as contrasted to the operation in the SIMD where the shared memory can only be accessed upon a serial basis as the data arrives.

However, the MIMD mode has operational bottlenecks when it is required to have interprocessor communication since then one processor must write the data to a memory and then the other processor must know the information is there and then access that memory. This can require several cycles of operational time and thus large images with vast pixel data could require high processing times. This is a major difficulty. In the structure of FIG. 10, as discussed, these problems have been overcome because the crossbar switch can serve to, on a cycle by cycle basis if necessary, interconnect various processors together to work from a single instruction for a period of time or to work independently so that data which is stored in a first memory can remain in that memory while a different processor is for, one cycle or for a period of time, connected to that same memory. In essence, in some of the prior art, the data must be moved from memory to memory for access by the various processors, which in the instant system the data can remain constant in the memory while the processors are switched as necessary between the memories. This allows for complete flexibility of processor and memory operation as well as optimal use of data transfer resources.

A specific example of the processing of data in the various SIMD and MIMD modes can be shown with respect to FIGS. 12 and 13. In FIG. 12 there is shown an image 125 having a series of pixels 0-n. Note that while in the image a row is shown having only four pixels, this is by way of example only, and a typical image would have perhaps a thousand rows, each row having a thousand pixels. At any one point in time the number of pixels in a row and the number of rows will vary. For our purposes, we will assume that the row has four pixels. One way of representing these pixels in memory 124 is to put them into individual addressable spaces shown as pixels 0, pixel 1 down to pixel n in memory 124. Of course, this can be one memory or a series of memories, as will be discussed hereinafter. The memories could be arranged such that each row is stored in a different memory.

Assume now that it is desirable to process all of the data, either for all of the pixels or for any subgroup of the pixels, so that all of the data is processed by the same instruction and is returned back to memory. In this manner the data from memory 124 pixel 0 would be loaded into processor 120 and then shifted from processors 120 to 121, to 122, to 123, and at each shift new data would be entered. Using this approach, each of the processors 120-123 has an opportunity to perform a function on the data as well as to observe the functions previously performed on the data. When the chain is finished, the data is returned to memory. This cycle can continue so that all of the pixels in the subset, or all of the pixels in the image, can be processed sequentially through the system. This type of operation is performed best in the SIMD mode.

This is in contrast to the arrangement shown in FIG. 13 where the MIMD data flow is illustrated. In such a system, it is perhaps desirable to have pixels 0 through 3 and 250-500 processed in a particular manner, while other pixels from other image regions (which differ from a certain region 3 of the image) are processed in a different manner. In this way then processor 120 would be arranged to process pixels 0-3 and pixels 250-500 while processor 121 is arranged to process pixels 50-75 and pixels 2000-3000. Each region can then be processed using different algorithms or by the same algorithm but with program flow changes that are dependant on the data contents. These pixels are all processed in parallel and stored at various memory locations. In this mode the MIMD operation would be faster than the SIMD operation except in situations where data would have to move from processor 121 to processor 120, in which case there would have to be a movement of data in the memory bank. This interprocessor data movement could be required, for example, in situations where data processed from a particular region is important in determining how to process data from another region, or for determining exactly what the total image represents. Just as it is difficult to determine the shape of an elephant from a grasp of its trunk, it is equally difficult to obtain meaningful information from an image without access to different portions of the pixel data.

Turning now to FIG. 14, there is graphically illustrated a system utilizing the present invention. Crossbar switch 20 allows processors 100-103 to access individual memories M1-M4 of memory 10, and on a cycle by cycle basis. The structure shown in FIG. 14 allows the operation described in FIG. 12 with respect to the SIMD operation such that the data in the memory elements, M1-M4 remains stationary and the connections from the processor switch. The continual flow of the process is enhanced by having more memory elements than actually utilized by the processors at a given instance. Thus, data can move in and out from these "extra" memory elements, and these extra elements can be cycled into the operational stream. In such an arrangement, data in and data out memory elements would, on a cycle by cycle basis, be different memory elements. Note that the data in and data out memories are switched through the crossbar and thus can be positioned in any of the memory elements. Thus, instead of moving the data between memories, the processor connection is sequentially changed.

Turning now to FIG. 15, the MIMD mode is shown such that processors 100-103 are connected through crossbar connections would last through several cycles and thus, the processors each would be connected to the respective memories for a period of time. While this is not necessary, it would be the most typical operation in the MIMD mode. For any processor, or group of processors operating in the MIMD mode of FIG. 15, crossbar switch 20 can, on a cycle by cycle basis, be operated so that data from a particular memory element is immediately made available to any of the other processors so that the data can either be cycled through the other processors or operated on a one-time basis.

Reconfigurable Interprocessor Communication

FIG. 16 shows the diagram of interprocessor communication when the system is operating in the MIMD mode when the various processors must communicate with each other. A processor, such as processor 100, sends a message through crossbar switch 20
to the shared parameter memory while at the same time registering a message (interrupt) in the destination processor that a parameter message is waiting. The destination processor, which can be any one of the other processors such as processor 102, then via crossbar switch 20 accesses the shared parameter memory to remove the message. The destination processor, for example, then could reconfigure itself in accordance with the received message. This reconfiguration can be internal to provide a particular system mode of operation or can be an instruction as to which memories to access and which memories not to access for a period of time.

The question of accessing memories (contention) is important because a processor can waste a lot of time trying to access a memory when another processor is using that memory for an extended period. The efficient operation of the system would be very difficult to achieve without the interprocessor coupling via the communication link.

Another type of message which is communicated between the processors relates to the synchronization of the processors. These messages and the precise manner in which synchronization is accomplished will be discussed hereinafter. FIG. 2 shows the full system arrangement where the processors are interconnected for interrupting or polling between them to control sync, memory and crossbar allocation on a cycle by cycle basis.

It is the communication links between the processors which function outside of the crossbar switch that supports a more efficient utilization of the memory. The number of cycles that are required to switch operational modes, for example between SIMD and MIMD, is dependent upon the amount of other operations which must be performed. These other operations are, for example, loading of code in various instruction memories and the loading of data into data memories for subsequent operation. The external communications help this function by establishing which memories a particular processor may access and instructing all of the processors as to their ability to access memories so that the processors are not waiting in line for access when the access is being denied.

The instructions between processors can be by interrupt and by polling. The interrupt can be in any one of the well-known interrupt configurations where data can be transmitted with a flag to point to particular message locations within the shared parameter memory or can operate directly on a pointer basis within the processor. The ability to establish on a cycle by cycle basis which processor has access to which memory is important in establishing the ability of the system to operate in the MIMD mode so that data can reside in a particular memory, and the processors which have access to that data are continually shifted. Using this arrangement then, several cycles of time, which would be required to move data from memory to memory if the memories were on a fixed relationship to processors, are dramatically eliminated. The communication link includes the master processor.

Transfer Processor

Transfer processor 11 shown in FIGS. 1 and 2 and in FIG. 57 transfers data between external memory and the various internal memory elements. Transfer processor 11 is designed to operate from packet requests such that any of the parallel processors or the master processor can ask transfer processor 11 to provide data for any particular pixel or a group of pixels or data, and the transfer processor will transfer the necessary data to or from external and internal memory without further processor intervention instructions. This then allows transfer processor 11 to work autonomously and to process data in and out of the system without monitoring by any of the processors. Transfer processor 11 is connected to all of the memories through switch matrix 20 and is arranged to contend with the various links for access to the memories. Transfer processor 11 for any particular link may be assigned the lowest priority and access a memory when another processor is not accessing that memory. The data that is being moved by the transfer processor is not only the data for processing pixels, but instruction streams for controlling the system. These instruction streams are loaded into the instruction memory via crossbar switch 20. Transfer processor 11 can be arranged with a combination of hardware and software to effect the purpose of data transfer.

Master Processor

The master processor, shown in more detail in FIGURE 29, is used for scheduling and control of the entire system, including the control of the transfer processor as well as the interaction between the various processors. The master processor has a connection through the crossbar switch to all of the memories and is interconnected with the other processors on the communication channel. The master processor can control the type of data and the manner in which the data is obtained by the transfer processor depending upon the pixel information and the particular purpose for which the information is being obtained. Thus, regions of the image can be scanned under different scan modes depending upon the purpose for the scan. This is controlled by the master processor working in conjunction with the parallel processors. The parallel processors may each also control the transfer processor, either alone or in conjunction with the master processor, again depending upon the purpose for the operation.

The contention for the memory to the crossbar switch can be arranged such that the parallel processors have higher priority, the master processor has lower priority, and the transfer processor has third or lowest priority for any particular memory on a particular link.

FIG. 11 shows a listing of various operations or algorithms which the imaging processing system would typically perform. A typical type of operation would be optical character recognition, target recognition or movement recognition. In each of these situations, the associated image processing would be controlled by the kind of operations to be performed.

In FIG. 11, the types of operations which are typically performed by the parallel processors are shown below line 1100 and the types of operations which are typically performed by the master processor are shown above line 1100. While this arrangement of operations is arbitrarily divided between the master processor and the parallel processors, the types of operations required to achieve the various operations shown tend to make them more suitable for either the master processor or the parallel processor.

As an example of image processing starting from an image and working higher in the hierarchy of operations, the image is first received by image enhancement 1111. In some situations it is necessary to compress or decompress the image via boxes
1112 and 1113. The image is then moved upwards through the various possibilities for edge extraction 1109, line linkage 1107, corner or vertices recognition 1105, histogram 1110, statistical properties 1108 and segmentation 1106. These boxes can all be skipped and the image provided directly to template matching 1102 for the purpose of determining the image identification 1101. There are various methods of achieving this identification, all of which are not necessary for every image, and all of which are well known in the art as individual algorithms or methods.

Enhancement block 1111 is a process which essentially cleans an image, removes extraneous signals and enhances details of the image, such as lines. Box 1109, edge extraction, is a process which determines the causes or existence of edges in an image. Box 1107 connects all the lines which have been extracted from the image and links them together to form longer lines. The process then removes extraneous dashes caused by inconsistencies in the data. Box 1105, corners and vertices, is an algorithm which determines where the corners of an image might be located. Once these geometric shapes are found, a process of grouping and labeling, block 1104, can then be used to identify major groupings of objects, such as circles and rectangles.

At this point, the operations have centered their focus on a smaller region of the image whereas in block 1111 the entire image is typically operated on. An alternate path after every enhancement is to perform statistical analysis, such as a histogram, 1110, of the intensities of the pixels. One purpose of a histogram is to discover the number of ones or the number of ones in a particular axis or projection which would then be useful statistical information to quantify the presence of some object or orientation of an object. This will be discussed hereinafter.

Block 1108, statistical properties, then extracts from these histograms the proper statistical properties. Continuing upward, block 1106 is a process of segmentation whereby the statistical properties could be used to segment different objects. As an example, several disconnected objects could then be quite easily segmented. Then through the progression to grouping and labeling 1104, where an image has different objects identified with specific labels. Connector component algorithms are typical in this area. At this point also certain geometric features can be analyzed 1103, particularly the perimeter of the object. Other shape descripters, Euler numbers, and a description of the surface can be obtained and used for future matching operations. Matching operations level 1102 is reached where similar information which is stored as templates or libraries are accessed and compared against the data that is extracted from the lower level. This can be either geometric, surface description or optical flow information. Once a match has occurred, these matches then are statistically weighted to determine the degree of certainty that an object has been identified as shown by block 1101. Once we have identified objects, we will in some applications such as stereopsis or motion have a three dimensional representation of the world knowing what the objects are and where they are placed in the world. At this point we can then re-render the scene using a graphics pipeline as shown by the right side of FIG. 11.

The first block, geometric model 1114, identifies a representation of this scene which basically is three coordinates showing position and a geometric description of the object such as its shape, density and reflective properties. At this point, depending upon the type of object, several different routes would be used to render the scene. If there were simple characters, two dimensional transforms would be employed. If they were more complex, three dimensional worlds would be created. A hand waving in front of a computer for use as a gesture input device would use this method and implement function 1116, which is a three dimensional transform. This would transform the input into a new coordinate system, either by translating scaling or rotating the three dimensional coordinates via 3D transform block 1116. Certain objects would be occluded by other objects. Again in the hand example, some fingers may be occluded by other fingers, and this operation using visibility block 1117 would then ignore the parts that were not visible. As we move down in FIG. 11 to shaded solid box 1118 we find a process which would generate gray scale or pixel information to give a smooth shaded solid image which would be more realistic and more lifelike than taking the other route down to clipping box 1120. Clipping box 1120 essentially clips things that are out of the field of view of the scene that is being generated.

In a special case of rendering fonts on a computer screen or on a laser printer or such, box 1119, font compilation, would be used to create sophisticated fonts of multiple sizes and shapes. Then the final process in the graphics program would be actually to draw the objects, via block 1121, which might be as simple as drawing dots and lines that connect the dots. We are now back at the original level of image enhancement 1111 and have recreated a synthetic representation of original image based upon a model which has been derived from that original image.

It is understood that once a character is recognized or a movement is recognized, an output can be obtained, either in binary code or otherwise, to control further processing of the same image via output control 1122 by the operation and the combination of the parallel processors and the master processor working with the image processing system.

Generally, the boxes shown below line 1100 are typically operationally efficient to be performed in the SIMD mode and require a vast amount of processing. These are performed with the parallel processing operation. The operations above line
1100 require relatively less processing capabilities and are less bandwidth intensive. Accordingly, they are performed by a single processor. Also note that with respect to the operations, as the hierarchy moves upwards on the chart the likelihood is that the MIMD operations would be the preferred operation. Often the SIMD and MIMD operations overlap, and both types of operational modes are required.

The main reason why two different types of processors are necessary is because of the level of the processing. High level processing, as performed by the master processor, preferably uses floating point arithmetic for high precision. High precision floating point processors require more real estate space and are slower to operate from non-floating point processors. Therefore, if all of the processors were the same, there could be fewer processors on a given chip which would increase the problem of bandwidth and slow down the operation of the system. On the other hand, the low level processors do not require floating point arithmetic and thus can be made faster and smaller, which in turn allows more processors to be constructed on a given chip. The bus structure shown utilizing a crossbar switch can therefore take several different types of processors as required and switch them into the system to perform portions of every operation if necessary.

The master processor is designed to operate primarily on lists such as information lists and display lists, whereas the parallel processors are intended to operate on arrays. At the low level image processing most of the information can be described as two dimensional arrays, whereas at the higher level, the information is described as lists of multidimensional coordinates. The manipulation of these two different types of data representations requires different processing structures which is another motivation for the master and parallel processors having different structures.

The master processor of the preferred embodiment would have features similar to a RISC (reduced instruction set computer) processor which is primarily intended for general purpose computing operations, whereas the parallel processors are more like digital signal processors (DSP) which tend to be specialized processors for arithmetic operations. Thus, the system could be optimized for the types of information processing required for image systems, while still maintaining the high degree of processing capability and the total flexibility achieved by using both types of processors on the same data.

Texas Instruments TMS 320 DSP processors are disclosed in coassigned U.S. Pat. Nos. 4,577,282 and 4,713,748 and 4,912,636. Further background is disclosed in the publications Second Generation TMS 320 Users Guide and Third Generation TMS 320
Users Guide from Texas Instruments Incorporated. These patents, said application and publications are hereby incorporated herein by reference.

Memory Structure

FIG. 17 shows a view of the image processing system, as discussed with respect to FIGS. 1 and 2, showing a particular layout of memory. It should be kept in mind, however, that the particular memory sizes have been selected for a particular project, and any type of arrangement of memory and memory capacities can be utilized with this invention. The parameter section of memory 10 can be incorporated within memory 10 or can be, if desired, a stand-alone memory. Under some conditions the parameter memory need not be present depending upon the communication requirements of the processors.

Crossbar Switch

FIG. 18 shows the prioritization circuitry of crossbar switch 20. Each vertical of the crossbar switch is connected in a round robin fashion to a prioritization circuit internal to the particular crosspoint. In every vertical the lowest horizontal, which is associated with the transfer processor, is not included in the prioritization wiring. This is so that when none of the other horizontals in the same vertical have been selected, the transfer processor has access to the memory. The exact manner in which the prioritization circuitry operates and the manner in which the lowest horizontal operates will be detailed more fully hereinafter with respect to FIGS. 19 and 20.

FIG. 18 also shows the special situation of the instruction vertical I for the parallel processors. The instruction vertical for parallel processor 103 is connected through crosspoint 4-7, which crosspoint is enabled by a signal on the SIMD lead via invertor 1801. This same signal is provided to every horizontal crosspoint 4-1 through 4-6 in the same vertical to render those crosspoints inactive. This signal and the manner in which the instruction vertical is connected to memory will be discussed hereinafter.

Turning now to FIG. 19, the details of an exemplary crosspoint 1-5 is shown in detail. In the figure, the five sided box with a control line entering the side is a control switch, typically a FET device.

The functionality of the crosspoint logic is described. The crosspoint logic contains four functional blocks. These will each be described. The first functional block is address recognition block 1901 which compares five bits of the address supplied by the processor on bus 1932 with the unique five bit value of the memory module 10-15 (connected to crosspoint 1-5 via vertical 1 as shown in FIG. 4) presented on bus 1930. The value presented on bus 1930 indicates the location of the memory within the address space. The comparison is achieved by five two-input exclusive-NOR gates 1920-1924 which perform individual bit comparisons. The outputs of these five gates are supplied to five of the inputs of the six input NAND gate 1910. The sixth input of gate 1910 is connected to the global access signal 1933 which indicates that a memory request is actually being performed and the address output by the processor should actually be compared. Only when signal 1933 is a logic one and the outputs of gates 1920-1924 are also all one will the output of gate 1910 be a logical zero. A logic zero indicates that a valid request for memory 10-15 is being made.

Digressing, a modification that can be made to this address recognition logic is to include a seventh input to gate 1910 (enable SIMD) that can be used as an enable signal for the crosspoint logic. A logical zero on the enable signal will cause the address recognition logic to be disabled, thus disabling the entire crosspoint. This is used on the crosspoints on vertical buses 4, 9, and 14 which connect to horizontal buses 1 to 6, to enable the crosspoints in SIMD mode and disable them in MIMD mode.

The second functional block is token latch 1904. This block outputs a signal B1 which is used to indicate the start point of the round-robin prioritization. Signal B1 connects to the input signal B of the next crosspoint logic vertically below crosspoint 1-5, (crosspoint 1-4). (Signal B1 of crosspoint 1-1 is wrapped around to connect to signal B of crosspoint 1-6 to create a circular prioritization scheme as shown in FIG. 18). Only one signal B1 within the crosspoint logics associated with vertical bus 1 will output a logical zero. All the others will output logical ones. This is achieved by only loading one crosspoint token latch 1904 with a value of zero at system initialization, and the other crosspoint token latches with a one. This is achieved by connecting the preset value signal to a logical zero on one crosspoint and a logical one on the others and activating clock5. This loads the preset value through transistor 1956 into the latch comprised of inverter 1946 and inverter 1945. This value in turn is clocked with clock2 through transistor 1955 into the latch comprising inverter 1947 and inverter 1948. The output of inverter 1947 is signal B1. This signal is supplied to one input of the two-input NAND gate 1913 whose other input is the output of gate 1910. The output of gate 1913 is supplied to one input of the two-input NAND gate 1914, whose other input comes from the output of gate 1911. The output of gate 1914 is clocked by clock4 through transistor 1953 into the earlier described latch of gates 1945 and 1946. It is arranged that clock2 and clock4 are never active simultaneously, and that clock4 is not active when clock5 is active.

The logic of the token latch records which crosspoint logic associated with memory 10-15 last gained access to the memory. This is indicated by a logical zero B1 signal being output by that crosspoint latch. The token latch logic works in conjunction with the prioritization block, to be described next, to cause the crosspoint which last accessed the memory to have the lowest priority access, if future multiple simultaneous accesses are attempted to the memory. How the token latch contents are altered will be described after the prioritization block has been described.

The prioritization block 1902 contains two two-input NAND gates 1911 and 1912. The two inputs of gate 1912 are supplied from the output of gates 1910 and 1911. The output of gate 1912 is signal A1 which connects to signal A of the vertically below crosspoint (1-4). One input of gate 1911 is the previously mentioned signal B which is connected to signal B1 from the token latch in the logic circuit associated with the next higher vertical (crosspoint 1-6). The other signal is also the previously described signal A which is connected to signal A1 from the prioritization block in the next higher vertical (crosspoint logic 1-6).

The prioritization logic forms a circular ripple path that begins with the crosspoint logic vertically below the last crosspoint to access the memory. This is indicated by a logical zero on a B1 signal. This causes the output of gate 1911 of the next vertical crosspoint below to be a logical one. This is gated by gate 1912 with the output of gate 1910 in order to produce signal A1. If the output of gate 1910 is a logical one, indicating that an address match by the address recognition logic wasn't found, then signal A1 will be a zero. This is passed to the next lower vertical crosspoint, causing its gate 1911 to output a one, and so on around the circular ripple path. If however the output of gate 1910 is a zero, then the signal A1
will be output to the next crosspoint as a logical one. This, in conjunction with a one on all subsequent B inputs (since only the ripple start point can output a zero B signal), causes all other gates 1911 around the ripple path to output logical zeros. Thus, a crosspoint can gain access to a memory only when it has a one on the output of its gate 1911 and it is producing a logical zero on the output of its gate 1910. This occurs only when an address match is found by the address recognition block and the crosspoint is the first to request a memory access from the start of the circular ripple path.

The management of the token latch contents will now be explained. Gates 1913 and 1914 are designed to make sure that the last crosspoint to gain memory access holds a zero in the token latch. Consider the following cases:

1. The token in token latch 1904 is a zero and no bus requires memory access. The zero ripples completely around the circular carry path and returns to signal A of the originating crosspoint as a zero, causing the output of gate 1911 to be a one. The zero already held in the token latch (signal B1) causes the output of gate 1913 to be a one. These two signals cause the output of gate 1914 to be a zero, which is loaded into the latch 1945/1946 by clock4 via transistor 1953, thus maintaining a zero in the token latch, thereby continuing the ripple propagation.

2. The token in token latch 1904 is a zero and one of the other crosspoints requires access to the memory. In this case, signal A will be received back as a one, which in conjunction with the one on input B will cause the output of gate 1911 to be a zero, causing the output of gate 1914 to be a one. This is then loaded into token latch 1904 by clock4 as a one. The token latch has thus become a one since another crosspoint has just gained memory access.

3. The token in token latch 1904 is a one and a crosspoint prioritized higher is requesting memory access. In this case A and B are both received as ones and, as in the above case, the token will similarly be loaded with a one.

4. The token in token latch 1904 is a one, the crosspoint is requesting memory access, and no higher priority crosspoint is requesting memory access. In this case either A or B will be received as a zero, causing the output of gate 1911 to be a one. The output of gate 1910 will be a zero, since the address recognition logic is detecting an address match. This will cause the output of gate 1913 to be a one. Since both inputs of gate 1914 are one, it will output a zero, which is loaded into token latch 1904 by clock4. The token latch has thus become a zero because it has just been granted memory access.

The fourth block of logic is the grant latch. The output of gate 1910 is passed through an inverter 1940 into one input of a two-input NAND gate 1915, whose other input is connected to the output of gate 1911. The one condition of a logical one on the output of gate 1911 and a zero on the output of gate 1910 causes the output of gate 1915 to be a zero. (Otherwise it is a one). This condition occurs when the crosspoint is successfully granted access to the memory, and can occur on only one of the crosspoints associated with the memory. The output of gate 1915 is loaded into latch 1941/1942 through transistor 1951 by clock1. (In practice clock1 and clock4 will operate together so that the token latch and the grant latch are updated together). The output of gate 1942 is loaded through transistor 1952 by clock2 into latch 1943/1944. The output of gate 1944 is passed to gate 1949 which produces the connect signal to the crosspoint switches 1905, which connect processor bus 1932 to memory bus 1931. These crosspoint switches can be individual n-type transistors in their simplest implementation.

The output of gate 1942 is also supplied to the gate of transistor 1958 which connects between signal 1934 and the source of transistor 1957, whose drain connects to ground, and whose gate is connected to clock2. Transistors 1957 and 1958 cause signal 1934 to be connected to ground when the crosspoint has successfully been granted memory access. This indicates to the processor that it can proceed with the memory access. If however signal 1934 does not go low when a memory access is attempted, then another crosspoint has gained memory access and the processor must halt and re-request access to the memory. The round-robin prioritization scheme described ensures that only a limited number of retries need be performed before access is granted.

An example of the timing of the crossbar signals is given in FIG. 20. In this figure PP2 and PP3 are both trying to access the same RAM every cycle, but the round-robin priority logic causes them to alternate. PP2 is calculating and outputting addresses S, T and U, and PP3 is calculating and outputting addresses V and W. It can be seen from the "15 MS" signals how the GRANTED-signal is used to multiplex between the last address (in the case of a retry) and the new address being calculated. The PPs assume that if the GRANTED-signal is not active by the end of the slave phase then contention occurred, and the master update phases of the fetch, address and execute pipeline stages are killed.

Integration of the Switch Matrix

As discussed herein, memory contention is handled by a token passing arrangement having logic circuitry individual to each crosspoint. In one embodiment, the logic circuitry is positioned in direct association with each crosspoint. Thus, since the crosspoints are spatially distributed across the substrate in conjunction with their respective ports, the contention control logic is likewise distributed spatially. In addition to saving space the actual logic of the circuit can grow as the switch grows. In this manner the logic can be positioned in one of the layers of the silicon so that no additional silicon chip area is consumed. This has the advantage of conserving space while also minimizing connections to and from the token passing circuit.

Synchronized MIMD

Each processor 100-103, as shown in FIG. 21, has associated with it a register 2100-2103 respectively for indicating if synchronized operation is required. Also included, as will be seen, is a register for holding the address (identity) of the other processors synchronized with that processor. The instruction stream contains instructions which indicate the beginning and end of a series of instructions that must be executed in synchronization with the processors. Once the code for starting a synchronized instruction stream arrives at a processor, that processor, and all the processors in the synchronized set, can only execute instructions in lock step with each other until such time as the end of synchronized code instruction is encountered.

Using this approach, no messages need be transferred between processors, and the processors will remain in step for one cycle, or a number of cycles, depending upon the instruction stream being executed. No external control, other than the instruction stream, is required to establish the synchronization relationships between processors.

Turning to FIG. 22, within each parallel processor 100-103, there is a sync register 2207 containing four bits labelled 3, 2, 1, 0 that relate to processors 103, 102, 101 and 100 respectively. One bit relates to each processor 100-103. The other processor(s) to which a particular processor will synchronize is indicated by writing a one to the bits corresponding to those processors. The other processor(s) which are expecting to be synchronized will similarly have set the appropriate bits in their sync register(s).

Code that is desired to be executed in synchronization is indicated by bounding it with LCK (Lock) and ULCK (Unlock) instructions. The instructions following the LCK, and those up to and including the ULCK, will be fetched in lock-step with the other parallel processor(s). (There must, therefore, be the same number of instructions between the LCK and ULCK instructions in each synchronized parallel processor).

It is more usually synchronized data transfer that is required rather than synchronized fetching of instructions. It is a consequence of the parallel processors, pipelines however that the transfer(s) coded in parallel with the LCK instruction and those up to and including the instruction immediately preceding the ULCK instruction, will be synchronous. They may not necessarily (due to memory access conflicts) occur in exactly the same machine cycle, but the transfers coded in the following instruction will not proceed until all the synchronized transfers of the previous instruction have occurred. The order of the load and store would otherwise be upset by memory access conflicts.

The knowledge that synchronized code is being executed is recorded by the S (synchronized) bit in each status register. (This bit is not actually set or reset until the master phase of the address pipeline stage of the LCK or ULCK instructions, respectively, but the effect of the LCK or ULCK instruction affects the fetch of the next instruction during the slave phase). This bit is cleared by reset and by interrupts once the status register has been pushed.

Continuing in FIG. 22, the four bits for each of the sync registers 2207 are set by software depending upon the desired synchronization between the various processors. Thus, assuming that processor 100 is to be synchronized with processor 103, then the bits shown would be loaded into the respective registers 2207. These bits would be 1, 0, 0, 1 showing that processor 3 is to be synchronized with processor 0. Also as shown, as processors 101 and 102 are to be synchronized, their respective sync control registers would each contain the bits 0, 1, 1, 0.

Turning now to processor 100, it should be noted that the presence of a 0 in any bit of sync register 2207 causes a logic one to appear on the output of the respective NAND gate. Thus, with the example shown, the NAND gates 2203 and 2204 would have logic ones on their respective output. These ones are supplied to the input of NAND gate 2206. NAND gate 2206 will not allow processor 100 to execute any more instructions of code until all of its respective inputs are one. Note that the presence of the zeros in the bit positions 1 and 2 of register 2207 causes the respective gates 2203 and 2204 to ignore the presence of any signals on leads 1 and 2 of bus 40. Thus, the execution of code was controlled by gate 2206, in this case in response to the information on leads 0 and 3 of bus 40. The lock instruction will cause the S bit to become set which is a logic 1 to one of the inputs to gate 2201. For the moment we will ignore the presence of the okay to sync signal which is a signal which controls the timing of the actual execute for the processor. The output of gate 2201 for each of the processors sync registers is connected to a different lead. Thus, gate 2201 from processor 100 is connected to lead 0, while gate 2201 from processor
101 is connected to lead 1, etc. Note that the output of gate 2201 from processor 100 is connected to the 0 input of gates 2205 of all of the other processor registers. Since in processor 101 and 102, gates 2205 are connected to logic zero, this has no effect. However, in processor 103 where gate 2205 is connected to a logic 1 of the register, it is thus controlled by the output on lead 0 of bus 40 which in fact is controlled by the output of gate 2201. Thus, processor 103 is controlled by the actions which occur within processor 100, which is exactly what we desire if processor 103 is to be synchronized with processor 100. A review of the circuitry would show that the same function operates in reverse from processor 103 to processor 100
since in processor 103 gate 2201 is associated with lead 3 of bus 40, which in turn is associated with gate 2202 of processor 100, which in turn is also controlled by a one in sync register 2207.

Now returning to the signal on gate 2201 which is the okay to sync signal. When that signal goes to logic 1, then it is okay to execute code, and all of the other processors having a one in the sync register bit 0 position of the respective register will operate in synchronization with that signal. Thus, if the okay to sync signal goes low signifying a problem with the cache memory or any other problem with the execution of code, all of the processors synchronized therewith will wait until the problem is clear. Thus, we have full synchronization between processors as controlled by the codes periodically stored in the respective registers. All of the processors can be synchronized or any combination of processors can be synchronized with each other, and there can be any number of different synchronizations occurring between processors.

Since it is the instruction fetch that is synchronized, it is possible to interrupt synchronized code. This will immediately cause the parallel processor's okay to sync signal to become inactive. Cache misses and contention will have a similar effect, keeping the machines in step. In the case of contention, however, the two instructions following the one experiencing contention will have already been fetched into the pipeline before the pipeline pauses.

It is possible to put idle instructions into synchronized code, thus pausing the operation of all the synchronized parallel processors until a particular parallel processor has been interrupted and returned from its interrupt routine.

Since it is necessary to be able to interrupt synchronized code, any instruction that specified the program counter PC in any one processor as a destination will immediately disable the effect of the S bit of the status register (with the same timing as the ULCK instruction), but the S bit will remain set. Once the two delay slot instructions have completed, the effect of the S bit is re-enabled. This mechanism prevents problems with being unable to interrupt synchronized delay slot instructions. The sync logic therefore treats branches, calls and returns (implemented as a PC load followed by two delay slot instructions) as a single instruction. The sync signal will be driven inactive during the two delay slot instructions and they will be fetched without looking at the sync signals. If a LCK instruction is put in a delay slot, it will take effect after the delay slot instructions have been executed. Synchronized loops behave like normal code because their branches operate in the fetch pipeline stage and not the execute stage.

An example of how synchronization works is given in FIG. 23. In this case, parallel processor 2 and parallel 1 exchange the contents of their data D0 registers (FIG. 33), assuming that A0 and A1 contain the same addresses in each parallel processor. It also assumes that A0 and Al point to different RAMs to avoid contention. (It would still work if they pointed to the same RAM, but would take extra cycles).

In this example parallel processor 1 arrives at its LCK instruction one cycle after parallel processor 2 arrives at its LCK instruction. Parallel processor 2 has thus waited one cycle. They then perform the stores simultaneously but parallel processor 2 then has a cache miss when fetching the load instruction. Both parallel processors wait until the cache miss has been serviced by the transfer processor. They then execute the loads simultaneously and similarly the ULCKs. Parallel processor 1 then experiences a cache miss when fetching instruction 4, but since the parallel processors are now unlocked, parallel processor 2 carries on unimpeded.

Synchronization in SIMD is implicit, so the LCK and ULCK instructions have no purpose and so will have no effect if coded. The S bit in the status register will have no effect if anyone should set it to one.

The instructions shown in the appendix (LCK) is used to begin a piece of MIMD synchronized parallel processor code. It will cause the parallel processor to wait until all the parallel processors indicated by ones in the sync register are in sync with each other. The following instructions will then be fetched in step with the other MIMD parallel processors. Execution of the address and execute pipeline stages will occur as each successive instruction is synchronously fetched. The S bit of the status register is set during the address pipeline stage of this instruction.

The instruction shown in the appendix (ULCK) unlocks the MIMD parallel processors from each other. They then resume independent instruction execution on the next instruction fetch.

Sliced Addressing

Sliced addressing is a technique for taking adjacent information from one memory space and distributing it in a manner to a number of separate different memory spaces so that the information when it has been distributed can be accessed simultaneously by a number of processors without contention.

As an example, reference is made to FIG. 24 where there is shown an external image memory buffer 15 having a row of adjacent pixels numbered 0-127, and this row has the letter "a" referencing it. This information is transferred, using the sliced addressing technique, via bus 2401, into memory subsystem 10 whereby the first sixteen pixels (0-15) are placed into the first memory 10-0 referred to by address 0-15. Then the next sixteen pixels are placed into memory 10-1. In this example this process is continued through eight memories such that pixels 112-127 are placed into final memory 10-7. The sliced addressing logic 2401 is implemented in the transfer processor and also in the crossbar address units of the parallel processors which will be described hereinafter.

The prior art means of address calculation would produce in the given example 128 consecutive addresses. This would mean that the data would be placed within one memory. In the given example the data would appear at consecutive addresses within memory 10-0. This would not allow a number of processors simultaneous access to that information without contention since they would all be trying to access the same memory. Thus, in the prior art, pixels 0-15 would be in row A of memory 0 with bits
16-31 in row B and bits 32-47 in row C, etc., until all of the 127 adjacent pixels would be in various rows of memory 0. Since the various different processors are working in parallel to process information, they could all contend for access to memory 0
to various pixel bytes, and accordingly time would be wasted, and the value of the parallel processing would be mitigated.

FIG. 25 shows a prior art adder which is used for controlling the location of the address for various bits. FIG. 25 shows three single bit adders 2501, 2502, 2503, which are part of a full adder having a number of single bits equal to the address range of the memory. These adders work such that one bit of the address is provided to each A input of the various adders 2501-2503. The least significant bit of the address would go to adder 2501, and the most significant bit would go to the highest single bit adder 2503.

The B input receives the binary representation of the amount to be indexed for the address for storage purposes. The combination of adders 2501-2503 will produce a resulting address which is used for accessing memory. Each individual adder will output a carry signal to the next highest numbered adder carry input signal. Each individual adder bit will take in the three inputs A, B and carry in, and if there are two or three ones present on any of those inputs, then the carry out from that cell will be a one. This is supplied to the next most significant carry in input of the adder. This process is repeated for each individual adder bit to produce a resultant address of the size required to access the memory space. The fact that each carry out connects directly to the next most significant carry in, means that the resultant address is always part of a contiguous address space. In the previous example, if an index of value one is supplied to the B inputs of the adder, then the resultant address output to memory will be one greater than the original address supplied on the A inputs.

With reference to FIG. 26, the modification to the previously described normal adder is made whereby the carry out of each cell is multiplexed with the carry in signal supplied to each cell, such that the signal that is passed to the next most significant carry in inputs of the adder can be selected to be either the carry out of the previous cell or the carry in for that previous cell. As an example, consider cell 2505. Its carry out signal is supplied to the multiplexer 2508, and the multiplexer's other input is the carry in signal to 2505. Signal B is used to control the multiplexer causing either the carry out or the carry in of cell 2505 to be passed on the carry in input of the next most significant cell.

Another modification to the standard cell is to include a control input labelled ADD which is supplied by the same control signal that controls the multiplexer signal B. If a logical one is supplied on signal B, then the carry in signal of 2505
is supplied to the carry in signal of the next most significant cell. The presence of a logical one on signal B also inhibits the add function of cell 2505 such that the original address supplied on input A is passed straight through to the output without modification. This has the effect of protecting the address bit associated with the presence of a one on input B. It can be seen that by supplying a number of ones to the control signals of the modified adder, the carry out of a cell from the least significant bit can be propagated a number of cells along the length of the adder before being supplied to the carry in of a cell which will perform the add function. This would be the next most significant cell which had a zero on the ADD control signal. The effect of this is to protect the address contained within the cells which have been bypassed so that a number of bits of the address range have been protected from modification. With reference to the previously described example, by supplying ones on the multiplexer and ADD control signals, an address increment from pixel 15 in memory 0 can be made to pixel 16 in memory 1 so that the memory can be addressed as one continuous address space. The multiplexer control signals are referred to as a sliced mask because they will mask out certain bits from the address range and cause the data which has been distributed in memory to be accessed as a slice indicated in FIG. 24.

It should be noted that this circuitry is used both for storing adjacent information or for retrieving adjacent information. Also, some information should be provided and stored in the same memory and should not be sliced, and this is denoted by providing all zeros to the ABC leads of the slice mask. When this occurs, the individual adders 2504-2506 act in the same manner as the prior art adders 2501-2503. It is also important to keep in mind that there are different types of distributed data that should be sliced across several memories and not just pixel information. This would occur anytime when it is conceivable that several processors would be accessing the same type of information at the same time for whatever processing would be occurring at that point.

It is also important to keep in mind that to distribute memory as disclosed in the sliced addressing mode does not in any way waste memory because the rows B and C which are not used for the particular pixel or other information to be stored would be used for other information. The only "penalty" that conceivably could occur is the additional chip space required to construct the multiplexers and the additional interconnections of the adders. This is a minor penalty to pay for the result of dramatically increased speed of access of memories for parallel processing while still allowing the flexibility of both distributing the adjacent information across many memories and allowing the information to be stored in a single memory under control of an external control. Using this approach, there is no fixed relationship for any particular piece of information so that at various times the information can be distributed across many memories or the same information at different times can be stored in the same memory depending upon the use of the information.

For example, if information which at one time is sliced because it is being used in a parallel processing mode is later determined to be used for a single processor for a single period of time, it would be advantageous to provide all zeros on the slice mask for that time period thereby storing the information in a single memory so that a single processor can then access the single memory, in this way again gaining valuable time over the slice method. This then gives a high degree of flexibility to the design of the system and to the operational mode for storing data.

Turning now to FIG. 27, an example of the way in which a typical quantity of pixels may be distributed over a number of memories is shown. In this example each individual memory is two kilobytes in size, and the start and end addresses of each of these memories are indicated. For example, memory 0 begins at address all zeroes and finishes as address 07FF. Memory 1 begins at 0800 and ends at 0FFF and so on through to memory 7 which begins at 3800 and ends at 3FFF. A quantity of pixels are shown distributed in a slice across these memories, 64 pixels per memory. Consider for a moment stepping through the 64 pixels within the slice of memory 3. We can see that the pixels are arranged from addresses 1900-193F. The next adjacent piece of information is not resident at the next address 1940 because the information was distributed over the memory system in a sliced manner. This means that the next piece of contiguous information is at address 2100 in memory 4. The prior art method of addition, as shown in FIG. 27, would add an index of one onto the address 193F to produce the address 1940. As previously mentioned, this is not the next piece of information required which is resident in the next memory at 2100. With reference to the bottom of the figure where the operation of addition using sliced arithmetic is shown, we can see that the value 193F is represented in binary form, and beneath that is the slice mask information similarly in binary form. As previously described, the presence of ones within the slice mask causes the carry out from an individual adder cell to be passed further along the carry path than the next most significant adjacent cell. In this example five adder cells are bypassed by the carry signal because there are five contiguous ones within the slice mask. Thus, when the index of one which is supplied to the B inputs of the modified adder is added to the value of 193F supplied to the A inputs of the modified adder, the carry out from the sixth least significant bit bypasses the seventh through eleventh significant bits and is passed into the carry in input of the twelfth least significant bit. This has the effect of incrementing those bits of the address including the twelfth and beyond significant bits which, because each memory is two kilobytes in size, has the effect of incrementing to the required address 2100 in the next memory.

Reconfigurable Memory

Before beginning a detailed description of how the MIMD/SIMD operational modes change the reconfigure of the memory, it would be good to review FIG. 4 with respect to the processors' memory and crossbar interconnections thereof. It will be recalled that in the MIMD mode the various processors each obtain their instructions from a separate memory. Thus, in the embodiment shown, processor 100 is connected over its instruction vertical through crosspoint 19-7 to instruction memory 10-1. Crosspoint 19-7 is normally closed except when the transfer processor is accessing the instruction memory in which case a signal is provided to crosspoint 19-7 to control the crosspoint and turn the crosspoint off.

In similar manner, processor 101 is connected via its instruction vertical and crosspoint 14-7 to instruction memory 10-5. Processor 102 is connected via its instruction vertical through crosspoint 9-7 to instruction memory 10-9 while processor
103 is connected via its instruction vertical through crosspoint 4-7 to instruction memory 10-13. This is the arrangement for the memory processor configuration when the system is in the MIMD operational mode.

When all or part of the system is switched to the SIMD operational mode, it is desired to connect memory 10-1 to two or more of the processors or to a group of processors depending upon whether both SIMD and MIMD are operating together or SIMD is operating on just a group of processors. In the embodiment shown we will assume that the SIMD operation is with respect to all four processors 100-103. In this case instruction memory 10-1 is connected to processor 100 via crosspoint 19-7 and three state buffer 403 is activated along with crosspoint 14-7 to connect memory 10-1 directly to the instruction vertical of processor 101. In similar manner three state buffers 402 and 401 are both operated to connect memory 10-1 to the respective instruction verticals of processors 102 and 103, via crosspoints 9-7 and 4-7, respectively.

At this point the system is constructed so that all of the processors 100-103 are operating from a single instruction stream provided from memory 10-1. Memories 10-5, 10-9 and 10-13, which were used for instructions in the MIMD mode, are now free to be used for other purposes. To increase memory capacity, at least on a temporary basis, these memories become available for access by all of the processors. The precise manner in which this is all accomplished will now be discussed.

Turning now to FIG. 28. Register 2820 contains the current operating mode of the system. This register contains bits which indicate whether the system is MIMD, SIMD, or some combination (hybrid) of SIMD and MIMD. From this register two signals are supplied, one indicating MIMD, the other SIMD. While the embodiment shows one pair of signals, in actual practice an individual pair of signals for each processor could be supplied. These signals are routed to the crosspoints and three state buffers to select the appropriate instruction streams for the appropriate configurations. In the MIMD configuration, processors 101, 102 and 103 are each executing their own instruction streams. These instruction streams are pointed to by program counters 2811, 2812 and 2813, respectively. These program counters are supplied to the cache logics 2801, 2802 and 2803, respectively. These have the effect of indicating if the instructions pointed to by the program counter are currently resident in the memory modules 10-5, 10-9 and 10-13, respectively. If the instructions indicated by the program counter are present, then the MIMD instruction address is output from the cache logic to the respective memory, and the appropriate instruction stream fetched back from that memory on the instruction vertical to the respective processor. If the instructions are not present within memory at this time, then the instruction execute will cease, and with reference to FIG. 4, crosspoints 13-0, 8-0 or 3-0
may be made to the transfer processors' bus. These are used by the respective processors for communicating the external address of the instructions required to be executed, and also the place within the instruction memory 10-5, 10-9 or 10-13, respectively, where the next sequence of instructions are to be stored. Once the transfer processor has fetched these instructions, an acknowledged signal is passed to the parallel processors from the transfer processor indicating that the code has now been fetched. The parallel processor can then perform instruction execution, again from the memory until such occasion as the instruction stream is found to be absent and the process is again repeated.

In the SIMD configuration because processors 101, 102 and 103 are executing from the same instruction stream, the cache logics 2801, 2802 and 2803 within the processors are disabled because they perform no function. The program counters 2811,
2812 and 2813 contents are irrelevant because they perform no purpose in fetching instructions because in the SIMD configuration all instructions are fetched by processor 100. In the SIMD configuration, therefore, it is desirable to use memories 10-5,
10-9 and 10-13 for storing data. In order to do this, crosspoints 14-1 through 14-6, 9-1 through 9-6 and 4-1 through 4-6 are enabled, thus allowing those memories to be accessed by the processors for data. This means that the memory utilization in the system is maintained at its optimum level for both SIMD and MIMD configurations.

Imaging Personal Computer

The imaging personal computer (PC) shown in FIGS. 46-52, can be constructed of three major elements, a camera sensing device 4600, shown in FIG. 46, an imaging processing device 4602 and a display device 4801 (FIG. 48). The imaging PC is not restricted to the use of a camera 4600 or a display 4803 and many forms of image input/output can be used.

Camera 4600 could be focused in front of display device 4803 of the PC and a hand 4603 can be used to input information by "signing" as typically done for deaf communication. The "signing" could be observed by the camera, and the screen could be used to display either the sign "two" or can be used to further process the information as discussed previously with respect to FIG. 11. The output bus from the PC could also contain the digital representation of the information being input via camera
4600, in this case the binary bits representing two. Thus, the user could utilize spreadsheets and other information obtaining information both from a keyboard or other traditional manner in ASCII (American standard code for information interchange) code as well as from a visual or video source such as

camera 4600 or video recorder device or any other type of video input using an imaging code input. The video input can be recorded on tape, on disc or on any other media and stored in the same manner as information is currently stored for presentation to a PC.

Some of the features that an imaging PC can have are 1) acquiring images from cameras, scanners and other sensors; 2) understanding the information or objects, in a document; 3) extracting pertinent information from a document or picture; 4) navigating through a data base combining images as well as textual documents; 5) providing advanced imaging interfaces, such as gesture recognition.

The PC can be used to create instant data bases since the information put into the system can be read and the informational content abstracted immediately without further processing by other systems. This creates a data base that can be accessed simply by a match of particular words, none of which had been identified prior to the storage. This can be extended beyond words to geometric shapes, pictures and can be useful in many applications. For example, a system could be designed to scan a catalog, or a newspaper, to find a particular object, such as all of the trees or all of the red cars or all trucks over a certain size on a highway. Conceptually then, a data base would be formed by words, objects, and shapes which the image processor would abstract and make useful to the user.

One use of such a PC with imaging capability is that both still and moving pictures and video can be integrated into a system or into any document, simply by having the picture scanned by the PC. The information then would be abstracted as discussed with respect to FIG. 11, and the output made available to the imaging PC for further processing under control of the user