United States Patent4847755
Morrison , ; et al.July 11, 1989

Title

Parallel processing method and apparatus for increasing processing throughout by parallel processing low level instructions having natural concurrencies

Abstract

A computer processing system containing a plurality of processor elements operates on a statically compiled program which, based upon detected natural concurrencies in the basic blocks of the programs, includes intelligence regarding logical processor allocation and an instruction firing time in the instruction stream. Each processor element, in one embodiment, is context free and is capable of executing instructions on a per instruction basis so that dependent instructions can execute on the same or different processor elements. A processor element is capable of executing an instruction from one context followed by an instruction from another context through use of shared storage resources.


Inventors:Morrison; Gordon E. (Denver, CO), Brooks; Christopher B.  (Boulder, CO), Gluck; Frederick G.  (Boulder, CO)
Assignee:MCC Development, Ltd. (Boulder, CO)
Appl. No.:794221
Filed:October 31, 1985

Current U.S. Class:712/203 712/25 712/28 
Field of Search:364/2MSFile,9MSFile

U.S. Patent Documents
3343135September 1967Freiman et al.
3611306October 1971Reigel
3771141November 1973Culler
4104720August 1978Gruner
4109311August 1978Blum et al.
4153932May 1979Dennis et al.
4181936January 1980Kober
4228495October 1980Bernhard
4229790October 1980Gilliland et al.
4241398December 1980Caril
4270167May 1981Koehler et al.
4430707February 1984Kim
4435758March 1984Lorie et al.
4466061August 1984DeSantis
4468736August 1984DeSantis
4514807April 1985Nogi
4574348March 1986Scallon
Other References
Dennis, "Data Flow Supercomputers", Computer, Nov., 1980, pp. 48-56. .
Hagiwara, H. et al., "A Dynamically Microprogrammable, Local Host Computer with Low-Level Parallelism", IEEE Transactions on Computers, C-29, No. 7, Jul., 1980, pp. 577-594. .
Fisher et al., "Microcode Compaction: Looking Backward and Looking Forward", National Computer Conference, 1981, pp. 95-102. .
Fisher et al., "Using an Oracle to Measure Potential Parallelism in Single Instruction Stream Programs", IEEE No. 0194-1895/0000/0171, 14th Annual Microprogramming Workshop, Sigmicro, Oct., 1981, pp. 171-182. .
J. R. Vanaken et al., "The Expression Processor", IEEE Transactions on Computers, C-30, No. 8, Aug., 1981, pp. 525-536. .
Bernhard, "Computing at the Speed Limit", IEEE Spectrum, Jul., 1982, pp. 26-31. .
Davis, "Computer Architecture", IEEE Spectrum, Nov., 1983, pp. 94-99. .
Hagiwara, H. et al., "A User-Microprogrammable Local Host Computer with Low-Level Parallelism", Article, Association for Computing Machinery, #0149-7111/83/0000/0151, 1983, pp. 151-157. .
McDowell, Charles Edward, "SIMAC: A Multiple ALU Computer", Dissertation Thesis, University of California, San Diego, 1983, (111 pages). .
McDowell, Charles E., "A Simple Architecture for Low Level Parallelism", Proceedings of 1983 International Conference on Parallel Processing, pp. 472-477. .
Requa, et al., "The Piecewise Data Flow Architecture: Architectual Concepts", IEEE Transactions on Computers, vol. C-32, No. 5, May, 1983, pp. 425-438. .
Fisher, A. T., "The VLIW Machine: A Multiprocessor for Compiling Scientific Code", Computer, 1984, pp. 45-52. .
Fisher et al., "Measuring the Parallelism Available for Very Long Instruction, Word Architectures", IEEE Transactions on Computers, vol. C-33, No. 11, Nov., 1984, pp. 968-976..~
Primary Examiner: Chan; Eddie P.
Attorney, Agent or Firm:Hale and Dorr

Claims


We claim:
1. A parallel processor system for processing natural concurrencies in streams of low level instructions contained in a plurality of programs in said system, each of said streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system comprising:
means (160) for statically adding intelligence to each instruction in each of said plurality of basic blocks for each said program, said added intelligence at least having a logical processor number (LPN) and an instruction firing time (IFT),
a plurality of context files (660), each of said context files being assigned to one of said plurality of programs for processing one of said programs, each of said context files having at least a plurality of registers and condition code storage for containing processing status information,
a plurality of logical resource drivers (LRDs), each logical resource driver being assigned to one of said plurality of context files, each of said logical resource drivers being receptive of said basic blocks corresponding to the program instruction stream of said assigned program from said adding means, each of said logical resource drivers comprising:
(a) a plurality of queues (1730), and
(b) means operative on said plurality of said basic blocks containing said intelligence from said adding means for delivering said instructions in each said basic block into said plurality of queues based on said logical processor number, said instructions in each said queue being entered according to said instruction firing time wherein the earliest instruction firing time is entered first,
a plurality of individual processor elements (PEs), each of said processor elements being free of any context information,
means (650) connecting said plurality of processor elements to said plurality of logical resource drivers for transferring said instruction with the earliest instruction firing time, first in said queues, from each of said logical resource drivers, in a predetermined order, to individually assigned processor elements, each said processor element having means for processing said transferred instruction,
first means (670) for connecting each of said processor elements with any one of said plurality of context files, each of said processor elements having means for accessing any of a plurality of registers and condition code storage in a program's context file during the processing of the program's instruction,
a plurality of memory locations (610), and
second means including said logical resource drivers for connecting each of said processor elements with any one of said plurality of memory locations, each said processor element having means for accessing said memory locations during said processing of each said instruction.

2. A parallel processor system for processing natural concurrence in streams of low level instructions contained in a plurality of programs in said system, each of said streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system comprising
means (160) for statically adding intelligence to each instruction in each of said plurality of basic blocks for each of said programs, said added intelligence representing at least a logical processor number (LPN) and an instruction firing time (IFT),
a plurality of context files (660), each of said context files being assigned to one of said plurality of programs for processing one of said programs, each of said context files having at least a plurality of registers and a condition code storage for containing processing status information,
a plurality of logical resource drivers (LRDs), each logical resource driver being assigned to one of said plurality of context files, each of said logical resource drivers being receptive of said basic blocks corresponding to the program instruction stream of said assigned program from said adding means, each of said logical resource drivers comprising:
(a) a plurality of queues (1730), and
(b) means operative on said plurality of said basic blocks containing said intelligence from said adding means for delivering said instructions in each said basic block into said plurality of queues based on said logical processor number, said instructions in each said queue being entered according to said instruction firing time wherein the earliest instruction firing time is entered first,
a plurality of context free individual processor elements (PEs),
means (650) connecting said plurality of processor elements to said plurality of logical resource drivers for transferring said instructions from each of said logical resource drivers, in a predetermined order, to individually assigned processor elements, each said processor element having means for processing said transferred instruction,
first means (670) for connecting each of said processor elements with any one of said plurality of context files, each of said processor elements having means for accessing any of a plurality of registers and condition code storages in a program's context file during said processing of the program's instruction,
a plurality of memory locations (610), and
second means including said logical resource drivers for connecting each of said processor elements with any one of said plurality of memory locations, each said processor element having means for accessing said memory locations during said processing of each said instruction.

3. A parallel processor system for processing natural concurrencies in streams of low level instructions contained in a plurality of programs in said system, each of said streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system comprising:
means (160) for statically adding intelligence to each instruction in each of said plurality of basic blocks for each of said programs, said added intelligence representing at least a logical processor number (LPN) and an instruction firing time (IFT),
a plurality of context files (660), each of said context files being assigned to one of said plurality of programs for processing one of said programs, each of said context files having a plurality of registers and a condition code storage for containing processing status information,
a plurality of logical resource drivers (LRDs), each logical resource driver being assigned to one of said plurality of context files, each of said logical resource drivers being receptive of said basic blocks corresponding to the program instruction stream of said assigned program from said adding means for storing said instructions, and fetching, in order of said instruction firing time, said instructions in each basic block, and delivering said instructions according to the logical processor number for each instruction,
a plurality of individual context free processor elements (PEs),
means (650) connecting said plurality of processor elements to said plurality of logical resource drivers for transferring said instructions from each of said logical resource drivers, in a predetermined order, to individually assigned processor elements, each said processor element having means for processing said transferred instruction,
first means (670) for connecting each of said processor elements with any one of said plurality of context files, each of said processor elements having means for accessing any of a plurality of registers and condition code storage in a program's context file during said processing of the program's instruction,
a plurality of memory locations (610), and
second means including said logical resource drivers for connecting each of said processor elements with any one of said plurality of memory locations, each said processor element having means for accessing said memory locations during said processing of each said instruction.

4. The parallel processor system according to claims 1, 2, or 3 in which:
said adding means further has means for statically adding shared context storage mapping (S-SCSM) information to said instructions, said statically added shared context storage information containing level information for each said instruction in order to identify the different program levels contained within each said program,
said context files having a different set of registers for each said program level,
said logical resource drivers dynamically adding shared context storage mapping information to said instructions in response to said statically added information for identifying subroutine levels of the program,
said dynamically added shared context storage mapping information corresponding to said sets of registers, and
said processor elements further having means for processing its instructions using sets of registers identified by said dynamically added shared context storage mapping information.

5. The parallel processor system according to claim 1, 2, or 3 in which:
each of said logical resource drivers further comprises means for dynamically adding shared context storage mapping (D-SCSM) information to each said instruction, said dynamically added shared context storage information containing the identity of the context file assigned to the program contained within each said logical resource driver,
each of said context files being assigned to one of said logical resource drivers, each said context file being identified by said dynamically added shared context storage mapping information, and
said processor elements further having means for processing each of its instructions in the context file identified by said dynamically added shared context storage mapping information.

6. A parallel processor system for processing natural concurrencies in streams of low level instructions contained in a plurality of programs in said system, each of said streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system comprising:
means (160) for adding intelligence to each instruction in each of said plurality of basic blocks, said added intelligence representing at least a logical processor number (LPN) and an instruction firing time (IFT),
a plurality of context files (660), each of said context files being assigned to one of said plurality of programs, each of said context files having a plurality of register resources,
a plurality of logical resource drivers (LRDs), each logical resource driver being assigned to one of said plurality of context files, each of said logical resource drivers being receptive of said basic blocks corresponding to the instruction stream of said assigned program from said adding means for storing said instructions in each basic block according to the logical processor number for each instruction,
a plurality of individual context free processor elements (PEs),
means (650) connecting said plurality of processor elements to said plurality of logical resource drivers for transferring said instructions from each of said logical resource drivers, in a predetermined order according to the instruction firing time, to individually assigned processor elements, each said processor element having means for processing said transferred instruction,
first means (670) for connecting each of said processor elements with any one of said plurality of context files, each said processor element having means for accessing said resources in a program's context file during said processing of the program's instruction,
a plurality of memory locations (610), and
second means including said logical resource drivers for connecting each of said processor elements with any one of said plurality of memory locations, each said processor element having means for accessing said memory locations during said processing of each said instruction.

7. A parallel processor system for processing natural concurrencies in streams of low level instructions contained in a plurality of programs in said system, each of said streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system comprising:
means (160) for adding intelligence to each instruction in each of said plurality of basic blocks,
a plurality of context files (660), each of said context files being assigned to at least one of said plurality of programs each of said context files having a plurality of register resources,
a plurality of logical resource drivers (LRDs) with each logical resource driver being assigned to one of said plurality of context files, each of said logical resource drivers being receptive of said basic blocks corresponding to the program instruction stream of said at least one assigned program from said adding means for storing said instructions in each basic block,
a plurality of individual processor elements (PEs),
means (650) connecting said plurality of processor elements to said plurality of logical resource drivers for transferring said instructions from each of said logical resource driver to individually assigned processor elements in accordance with said added intelligence added to the instructions,
first means (670) for connecting each of said processor elements to any one of said plurality of context files, each said processor element having means for accessing any resource in a program's context file during said processing of the program's instruction,
a plurality of memory locations (610), and
second means including said logical resource drivers for connecting each of said processor elements with any one of said plurality of memory locations, each said processor element having means for accessing said memory locations during said processing of each said instruction.

8. The parallel processor system according to claims 6 or 7 in which:
said adding means further has means for adding to each said instruction, information containing level information for each said instruction for identifying different program levels contained within each said program,
said context files have a different set of register resources for each said program level,
each said set of resources is identified by said added information, and
said processor elements further have means for processing each of its instructions in a set of register resources identified by said added information.

9. The parallel processor system according to claims 6 or 7 in which
each of said logical resource drivers further comprises means for adding information to each said instruction, said added information containing the identity of the context file assigned to the programs contained within each said logical resource driver,
each of said context files is assigned to one of said logical resource drivers, each said context file being identified by said added information, and
said processor elements further have means for processing each of its instructions using the context file identified by said added information.

10. A parallel processor system for processing natural concurrencies in streams of low level instructions contained in a plurality of programs in said system, at least one of said programs having a plurality of different program levels, each of said streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system comprising:
means for adding intelligence to each instruction in each of said plurality of basic blocks, said added intelligence containing program level information for each said instruction to identify the different program levels contained within each said program,
a plurality of context files (660), each of said context files being assigned at least one of said plurality of programs, each of said context files having a plurality of register resources with a different set of register resources for each said program level, each said set of resources being identified by said added intelligence,
a plurality of logical resource drivers (LRDs), each logical resource driver being assigned to one of said plurality of context files, each of said logical resource drivers being receptive of said basic blocks corresponding to the program instruction stream of said at least one assigned program from said adding means for storing said instructions of each basic block of the assigned program instruction stream, each of said logical resource drivers further having means for adding information to each said instruction, said added information containing the identity of the context file assigned to the programs contained within each said logical resource driver,
a plurality of individual processor elements (PEs),
means (650) connecting said plurality of processor elements to said plurality of logical resource drivers for transferring said instructions from each of said logical resource drivers to individually assigned processor elements,
first means (670) for connecting each of said processor elements to any one of said plurality of context files, each said processor element having means for accessing a set of resources, as identified by said added intelligence, in a program's context file as identified by said added information, during said processing of the program's instruction,
a plurality of memory locations (610), and
second means including said logical resource drivers for connecting each of said processor elements with any one of said plurality of memory locations, each said processor element having means for accessing said memory locations during said processing of each said instruction.

11. A parallel processor system for processing natural concurrencies in streams of low level instructions contained in a program, each of said streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system comprising:
means (160) for statically adding intelligence to each instruction in each of said plurality of basic blocks, said added intelligence representing at least a logical processor number (LPN) and an instruction firing time (IFT) for the instructions,
a logical resource driver (LRD) receptive of said basic blocks corresponding to the program instruction stream from said adding means for storing said instructions and said logical resource driver further having means for fetching, in order of said instruction firing time, said instructions in each basic block, and delivering said instructions according to the logical processor number for each instruction,
a plurality of individual context free processor elements (PEs),
means (650) connecting said plurality of processor elements with said logical resource driver for transferring said instructions from said logical resource driver to individually assigned processor elements, each said processor element having means for processing said transferred instruction,
a plurality of shared storage resources (660)
first means (670) for connecting each of said processor elements with any one of said plurality of resources, each of said processor elements having means for accessing any one of said resources during said instruction processing,
a plurality of memory locations (610), and
second means including said logical resource drivers for connecting each of said processor elements with any one of said plurality of memory locations, each said processor element having means for accessing said memory locations during said instruction processing.

12. A parallel processor system for processing natural concurrencies in streams of low level instructions contained in a program, each of said streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system comprising:
means (160) for statically adding intelligence to each instruction in each of said plurality of basic blocks, said added intelligence representing at least a logical processor number (LPN) and an instruction firing time (IFT),
a logical resource driver (LRD) receptive of said basic blocks corresponding to the program instruction stream from said adding means for storing said instructions and said logical resource driver further having means for fetching, in order of said instruction firing time, said instructions in each basic block, and delivering said instructions according to the logical processor number for each instruction,
a plurality of individual processor elements (PEs),
first means (650) connecting said plurality of processor elements with said logical resource driver for transferring said instructions from said logical resource driver to individually assigned processor elements, each said processor element having means for processing said transferred instruction,
a plurality of shared storage resources (660), and
second means (670) for connecting each of said processor elements with any one of said plurality of shared storage resources, each of said processor elements having means for acessing any one of said shared storage resources during said instruction processing.

13. A parallel processor system for processing natural concurrencies in streams of low level instructions contained in a program, each of said streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system comprising:
means (160) for statically adding intelligence to each instruction in each of said plurality of basic blocks,
a logical resource driver (LRD) receptive of said basic blocks corresponding to the program instruction stream from said adding means for storing said basic blocks,
a plurality of individual context free processor elements (PEs),
means (650) connecting said plurality of processor elements with said logical resource driver for transferring said instructions from said logical resource driver to individually assigned processor elements in accordance with the added intelligence in the instructions, each said processor element having means for processing said transferred instruction,
a plurality of shared storage resources (660), and
means (670) for connecting each of said processor elements with any one of said plurality of shared storage resources in accordance with the added intelligence in the instructions, each of said shared storage resources during said instruction processing.

14. The parallel processor system according to claims 11, 12, or 13 in which:
said adding means further has means for statically adding program level information to each said instruction to identify the different program levels contained within each said program,
said shared storage resources have a different set of resources for each said program level,
each said set of resources is identified by said statically added level information, and
each said processor element further has means for processing each of its instructions in a set of resources identified by said statically added level information.

15. A method for parallel processing in a plurality of processor elements natural concurrencies in streams of low level instructions contained in the programs of a plurality of users, each of the streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said method comprising the steps of:
statically adding intelligence representing the natural concurrencies existing within the instructions in each basic block of the programs, said step of adding for each program comprising the steps of:
(a) ascertaining the resource requirements of each instruction within each basic block to determine the natural concurrencies in each basic block,
(b) identifying logical resource dependencies between instructions,
(c) assigning condition code storage (CCs) to groups of resource dependent instructions, so that dependent instructions can execute on the same or different processor elements,
(d) determining the earliest possible instruction firing time (IFT) for each of said instructions in each of said plurality of basic blocks,
(e) adding said instruction firing times to each instruction in each of said plurality of basic blocks,
(f) assigning a logical processor number (LPN) to each instruction in each of said basic blocks,
(g) adding said logical processor numbers to each instruction in each of said basic blocks, and
(h) repeating steps (a) through (g) until all basic blocks are processed for each of said programs, and
processing the instruction having the statically added intelligence for the programs, the step of processing further comprising the steps of:
(i) delivering the instructions into logical resource drivers, each said program of a user being assigned to a different logical resource driver,
(j) selecting instructions from the logical resource drivers in a predetermined order based on the instruction firing time,
(k) storing the selected instructions in queues of the logical resource driver based on the logical processor number,
(l) generating dynamic shared context storage mapping (D-SCSM) information for each instruction,
(m) selectively connecting the queues of each logical resource driver to processor elements (PEs) said queues being connected in a predetermined order so that one instruction having the earliest instruction firing time from each queue is first delivered to a given processor element,
(n) processing said one instruction from each queue in each said connected processor element,
(o) obtaining the input data for processing said delivered instruction from shared storage locations identified by said instruction in a context file identified by said dynamic shared context storage mapping information,
(p) storing the results of said processing of said delivered instruction in shared storage identified by said dynamic information contained in said instruction, and
(q) repeating steps (i) through (p) until all instructions in each of said plurality of basic blocks for all said programs are processed.

16. The method of claim 15 wherein said step of statically adding intelligence further comprises the step of re-ordering said instructions in each of said basic blocks based upon said instruction firing times wherein the earliest firing times are listed first.

17. The method of claim 15 wherein said step of statically adding intelligence further comprises the step of adding static shared context storage mapping information (S-SCSM) to each instruction to identify the relative program levels associated with said instructions and wherein said step of obtaining input further comprises the step of obtaining said input data from a shared stored location procedural level identified at least in part by the aforesaid dynamically added information.

18. The method of claim 15 wherein said step of statically adding intelligence further comprises the step of adding static shared context storage mapping (S-SCSM) information to each instruction to identify the program level of said instruction.

19. A method for parallel processing natural concurrencies in streams of low level instructions contained in the programs of a plurality of users located in a system having a plurality of processor elements and shared storage locations, each of the streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said method comprising the steps of:
statically adding intelligence to the instructions in each basic block of the programs, said added intelligence identifying the natural concurrencies within each basic block, said added intelligence having a least an instruction firing time (IFT) and a logical processor number (LPN), and
processing the instructions having the statically added intelligence for executing the programs, the step of processing further comprising the steps of:
(a) delivering the instructions into the system, each said user being assigned to a different context file in said system,
(b) dynamically generating shared context state mapping (D-SCSM) information for each instruction identifying said context file containing shared storage locations,
(c) separately storing the delivered instructions in the system based on the logical processor number (LPN),
(d) selectively connecting the separately stored instructions to the processor elements assigned to the logical processor number for the instructions, said separately stored instructions being delivered in a predetermined order so that one instruction having the earliest instruction firing time from each of the separately stored instructions is delivered to a given processor element,
(e) processing said one instruction from each connected separately stored instructions in each said connected processor element,
(f) obtaining the input data for processing said connected instruction from a shared storage location identified at least in part by said shared context storage mapping information,
(g) storage the results of said processing of said connected instruction in a shared storage location identified in part by said shared context storage mapping information, and
(h) repeating steps (a) through (g) until all instructions of each of said plurality of basic blocks for all of said programs are processed.

20. A method for parallel processing, in a system, natural concurrencies in streams of low level instructions contained in a program, each of the streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system having a plurality of processor elements and a plurality of shared storage locations, said method comprising the steps of:
statically adding intelligence representing the natural concurrencies existing within the instructions in each basic block, said step of adding comprising the steps of:
(a) ascertaining the resource requirements of each instruction within each basic block to determine the natural concurrencies in each basic block,
(b) identifying logical resource dependencies between instructions,
(c) assigning condition code storage (CCs) to groups of resource dependent instructions, such that dependent instructions can execute on the same of different processor elements,
(d) determining the earliest possible instruction firing time (IFT) for each of said instructions in each of said plurality of basic blocks,
(e) adding said instruction firing times to each instruction in each of said plurality of basic blocks,
(f) assigning a logical processor number (LPN) to each instruction in each of said basic blocks,
(g) adding said logical processor numbers to each instruction in each of said basic blocks, and
(h) repeating steps (a) through (g) until all basic blocks are processed for said program, and
processing the instructions having the statically added intelligence in the system, the step of processing further comprising the steps of:
(i) separately storing instructions delivered in the system based on the logical processor number,
(j) selectively connecting the separately stored instructions to the processor element (PE) assigned to the logical processor number, said separately stored instructions being delivered in a predetermined order so that one instruction having the earliest instruction firing time is connected to a given processor element,
(k) processing said one instruction from each of the separately stored instructions in each said connected processor element, and
(l) repeating steps (i) through (k) until all instructions in each of said plurality of basic blocks for said program are processed.

21. A method for parallel processing, in a system, natural concurrencies in streams of low level instructions contained in a program, each of the streams having a plurality of single entry-single exit (SESE) basic blocks (BBs), said system having a plurality of processor elements and a plurality of shared storage locations, said method comprising the steps of:
statically adding intelligence representing the natural concurrencies existing within the instructions in each basic block, said step of adding comprising the steps of:
(a) determining the earliest possible instruction firing time (IFT) for each of said instructions in each of said plurality of basic blocks,
(b) adding said instruction firing times to each instruction in each of plurality of basic blocks,
(c) assigning a logical processor number (LPN) to each instruction in each of said basic blocks, and
(d) adding said logical processor numbers to each instruction in each of said basic blocks, and
processing the instructions having the statically added intelligence, the step of processing further comprising the steps of:
(e) separately storing instructions delivered in the system based on the logical processor number,
(f) selectively connecting said separately stored instructions to processor elements (PEs) assigned to a given logical processor number, said instructions connected in a predetermined order so that one instruction having the earliest instruction firing time from each of said separately stored instructions is connected to a given processor element,
(g) processing said one connected instruction in each said connected processor element,
(h) obtaining the input data for processing said connected instruction from a shared storage location identified by said instruction,
(i) storing the results of said processing of said connected instruction in a shared storage location identified by said instruction, and
(j) repeating steps (e) through (i) until all instructions are processed in each of said plurality of basic blocks for said program.

22. A method for parallel processing natural concurrencies in a program in a system having a plurality of processor elements (PEs), said processor element having access to input data located in a plurality of shared resource locations, said program having a plurality of single entry-single exit (SESE) basic blocks (BBs) with each of said basic blocks (BBs) having a stream of instructions, said method comprising the steps of:
ascertaining the resource requirements of each instruction within each of said basic blocks,
identifying logical resource dependencies between instructions,
assigning condition code storage (CCs) to groups of resource dependent instructions, such that dependent instructions can execute on the same of different processor elements,
determining the earliest possible instruction firing time (IFT) for each of the instructions in said plurality of basic blocks,
adding said instruction firing times (IFTs) to each instruction in each of said plurality of basic blocks in response to said determination,
assigning a logical processor number (LPN) to each instruction in each of said basic blocks,
adding said assigned logical processor number (LPN) to each instruction in each of said plurality of basic blocks in response to said assignment,
separately storing the instructions, with said added instruction firing time and said added logical processor numbers, based on the logical processor number, each group of said separately stored instructions containing instructions having only the same logical processor number,
selectively connecting said separately stored instructions to said processor elements based on the logical processor number, and
each said processor element receiving the instruction in said connected group having the earliest instruction firing time first, said processor element having means for performing the steps of:
(a) obtaining input data for processing said received instruction from a shared storage location in said plurality of shared resource locations identified by said instruction,
(b) storing the results based upon the aforesaid step of processing in a shared storage location in said plurality of shared resource locations identified by said received instruction, and
(c) repeating the aforesaid steps (a) and (b) for the next received instruction until all instructions are processed.

23. The method of claim 22 further comprising the steps of forming execution set of basic blocks in response to said steps of adding said instruction firing times and logical processor numbers wherein branches from any given basic block within a given execution set to a basic block in another execution set is statistically minimized.

24. The method of claim 22 further comprising the step of:
adding shared context storage mapping information to each instruction, and
wherein said step of processing comprises the step of processing each instruction requiring at least one set of shared resources, said at least one set being identified by said shared context storage mapping information, so each program routine can access, in addition to the routine's set of procedural level resources, at least one other set of resources.

25. A method for parallel processing instructions of a program having natural concurrencies with a plurality of processor elements (PEs), said processor elements having access to input data located in a plurality of shared resource locations, said program having a plurality of single entry-single exit (SESE) basic blocks (BBs) with each of said basic blocks (BBs) having a stream of method instructions, said method comprising the steps of:
ascertaining the resource requirements of each instruction within each of said basic blocks,
identifying logical resource dependencies between instructions,
assigning condition code storage (CCs) to groups of resource dependent instructions, such that dependent instructions can execute on the same or different processor elements,
determining the earliest possible instruction firing time (IFT) for each of the instructions in each of said plurality of basic blocks,
adding said instruction firing times to each instruction in each of said plurality of basic blocks in response to said determination,
assigning a logical processor number (LPN) to each instruction in each of said basic blocks,
adding said assigned logical processor number to each instruction in each of said plurality of basic blocks in response to said assignment,
forming execution sets (ESs) of basic blocks in response to said steps of adding said instruction firing times and logical processor numbers,
(i) separately storing the instructions contained within a given formed execution set based on the logical processor number, each group of said separately stored instructions containing instructions having only the same logical processor number,
(ii) selectively connecting said separately stored instructions to said processor elements elements based on the logical processor number,
(iii) each said processor element receiving the instruction having the earliest instruction firing time (IFT) first, said processor element having means for performing the steps of:
(a) obtaining input data for processing said received instruction from a shared storage location in said plurality of shared resource locations as identified by said instruction,
(b) storing the results based upon the aforesaid step of processing in a shared storage location in said plurality of shared resource locations as identified by said instruction,
(c) repeating the aforesaid steps (a) and (b) for the next received instruction until all instructions are processed, and
(iv) repeating the aforesaid steps (i) through (iii) for all execution sets that are processed.

26. A system for parallel processing natural concurrencies in a program, said program having a plurality of single entry-singe exit (SESE) basic blocks (BBs) wherein each of said basic blocks contains a stream of instructions, said system comprising:
means (160) receptive of said plurality of basic blocks for determining said natural concurrencies within said instruction stream for each of said basic blocks, said determining means further having means for adding timing and processor information to each instruction in response to said determined natural concurrencies so that all processing resources required by any given instruction are allocated in advance of program execution,
means (620) receiving said basic blocks (BBs) of instructions having said added timing and processor information for storing said received instructions,
a plurality of processor elements (PEs),
means (650) for selectively connecting said plurality of processor elements to said storing means,
a plurality of shared resource (660),
means (670) for selectively interconnecting said plurality of processor elements (PEs) with said plurality of shared resources (660),
said storing means having means for delivering instructions in order of the earliest firing time first, based upon said firing timing information, to the processor elements over said connecting means and into said processor elements, and
said processor elements having means for processing each received instruction from said storing means (620), said processor elements being connected to the shared resources identified by each said instruction and wherein all resource information and context information pertaining to said instruction are each stored in one of said plurality of shared resources and said storing means (620).

27. The parallel processor system of claim 26 in which:
said determining means further has means for adding program level information to said instructions, said information containing relative level information for each said instruction in order to identify the different program levels associated with registers used within said program,
said shared resources having a different set of registers associated with each program level, and
said processor elements further having means for processing each of its received instructions using at least one set of registers identified by said received instructions.

28. The system of claim 26 wherein said determining means further comprises means for forming execution sets (ESs) from the basic blocks containing said added timing and processor information wherein branches from any given basic block within a given execution set out of said given execution set to a basic block in another execution set is statistically minimized.

29. The system of claim 28 wherein said determining means further has means for attaching header information to each formed execution set, said header at least comprising:
(a) the address of the start of said instructions, and
(b) the length of the execution set.

30. The system of claim 26 wherein said storing means further comprises:
a plurality of caches (1522) receptive of said execution sets for storing said instructions,
means (1544, 1560, 1570, 650) connected to said caches for delivering said instructions stored in each of said caches to said plurality of processor elements, and
means (1512, 1518, 1548) connected to said caches and said delivering means for controlling said storing and said delivery of instructions, said controlling means further having means for executing the branches from individual basic blocks.

31. The system of claim 26 wherein said determining means further comprises means for adding level information to instructions pertaining to the different program levels contained within said program, and wherein each said processor element has means for processing each of its received instructions using a set of shared resources identified by each said instruction's level information.

32. The system of claim 26 wherein each of said plurality of processor elements is context free in processing a given instruction.

33. The system of claim 32 wherein said plurality of shared resources comprises:
a plurality of register files, and
a plurality of condition code files, said plurality of register files and said plurality of condition code files together with said storing means (620) having means for storing all necessary context data for processing any given instruction.

34. A system for parallel processing natural concurrencies in a program, said program having a plurality of single entry-single exit (SESE) basic blocks (BBs) wherein each of said basic blocks (BBs) contains a stream of instructions, said system comprising:
means (160) receptive of said plurality of basic blocks for determining said natural concurrencies within said instruction stream for each of said basic blocks, said determining means further having means for adding timing, processor, and resource access information to each instruction in response to said determined natural concurrencies so that all processing resources required by any given instruction are allocated in advance of instruction execution, said determining means further having means for forming basic blocks into execution set (ESs) containing said added instruction firing times and logical processor numbers, wherein branches from any given basic block within a given execution set out of said given execution set to a basic block in another execution set is statistically minimized,
means (620) receptive of said execution sets having said added information for storing said instructions,
a plurality of context-free processor elements connected to said storing means, and
a plurality of shared resources connected to said plurality of context-free processor elements, said processor elements having means for processing its instructions based upon said processor number, timing, and resource access information from said storing means, said processor elements having means for obtaining all necessary context data from said plurality of shared resources based upon location data set forth in said instructions for processing said instruction and having means for delivering all necessary context data to said plurality of shared resources at locations based upon and identified in said instructions.

35. A parallel processor system for processing streams of low level instructions contained in a plurality of programs in said system, each of said streams having a plurality of single entry-single exit basic blocks, said system comprising:
means (160) for adding intelligence to the instruction streams in response to detected natural concurrencies therein,
a plurality of context files, each of said plurality of programs being assigned to one of said context files, each of said context files having a plurality of register resources,
a plurality of logical resource drivers, each logical resource driver being assigned to one of said plurality of context files, each of said logical resource drivers being receptive of said basic blocks of at least one corresponding program instruction stream from said adding means for storing said instructions of each basic block,
a plurality of individual processor elements,
means for selectively connecting any of said plurality of processor elements to any of said plurality of logical resource drivers for transferring said instructions from each of said logical resource drivers to individually assigned processor elements in accordance with said added intelligence in the instructions,
first means for connecting each of said processor elements to any one of said plurality of context files, each said processor element having means for accessing any resource in a program's context file for reading and writing data during said processing of the program's instructions,
a plurality of memory locations (610), and
second means for connecting each of said logical resource drivers with any one of said plurality of memory locations, each said logical resource driver having means for accessing said memory locations during said processing of each said instruction.

36. A parallel processor system for processing streams of low level instructions contained in a plurality of programs, each of said streams having a plurality of single-entry-single exit basic blocks, said system comprising:
means for adding intelligence to said instruction streams, said added intelligence containing subroutine level information for said instructions to specify the relative subroutine register accesses contained within each said program,
a plurality of context files, each of said plurality of programs being assigned to one of said context files, each of said context files having a plurality of register resources with one set of register resources for each said subroutine program level, each said set of resources being identified with a different subroutine level,
a plurality of logical resource drivers, each logical resource driver being associated with one of said plurality of context files, each of said logical resource drivers being receptive of said basic blocks corresponding to the program instruction stream of said at least one assigned program from said adding means for storing said instructions of each basic block, each of said logical resource drivers further having means for adding information to each said instruction, said added information to each said instruction, said added information containing the identity of the context file assigned to the programs contained within each said logical resource driver and, in response to the added intelligence, the level of any subroutine access of the context files by the processor,
a plurality of individual processor elements,
means connecting said plurality of processor elements and said plurality of logical resource drivers for transferring said instructions from each of said logical resource drivers to individually assigned processor elements, and
first means (670) for connecting each of said processor elements with any one of said plurality of context files, each said processor element having means for accessing a set of resources, in a program's context file as identified by said added information, during said processing of the program's instruction in the logical resource drivers.

37. A multiprocessor system for processing a plurality of programs of different users, said system comprising:
a plurality of logical resource drivers, each said logical resource driver being operative on at least one of said programs for dynamically adding information during program execution to instructions of said one program, said information identifying at least the user context file for said one program,
a plurality of set of shared resources, each of said sets being assigned to a given context file, and
a plurality of processor elements for processing said programs, said processor elements being connected to said plurality of sets of shared resources and connected to said logical resource drivers for receiving instructions in a determined order, each of said plurality of logical resource drivers having means for delivering an instruction to any processor element, each of said processor elements being selectively interconnected with that set of the shared resources identified by said user context information added to the instruction then being processed in order to access all data necessary for processing said instruction.

Description

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention generally relates to parallel processor computer systems and, more particularly, to parallel processor computer systems having software for detecting natural concurrencies in instruction streams and having a plurality of processor elements for processing the detected natural concurrencies.

2. Description of the Prior Art

Almost all prior art computer systems are of the "Von Neumann" construction. In fact, the first four generations of computers are Von Neumann machines which use a single large processor to sequentially process data. In recent years, considerable effort has been directed towards the creation of a fifth generation computer which is not of the Von Neumann type. One characteristic of the so-called fifth generation computer relates to its ability to perform parallel computation through use of a number of processor elements. With the advent of very large scale integration (VLSI) technology, the economic cost of using a number of individual processor elements becomes cost effective.

Whether or not an actual fifth generation machine has yet been constructed is subject to debate, but various features have been defined and classified. Fifth-generation machines should be capable of using multiple-instruction, multiple-data (MIMD) streams rather than simply being a single instruction, multiple-data (SIMD) system typical of fourth generation machines. The present invention is of the fifth-generation non-Von Neumann type. It is capable of using MIMD streams in single context (SC-MIMD) or in multiple context (MC-MIMD) as those terms are defined below. The present invention also finds application in the entire computer classification of single and multiple context SIMD (SC-SIMD and MC-SIMD) machines as well as single and multiple context, single-instruction, single data (SC-SISD and MC-SISD) machines.

While the design of fifth-generation computer systems is fully in a state of flux, certain categories of systems have been defined. Some workers in the field base the type of computer upon the manner in which "control" or "synchronization" of the system is performed. The control classification includes control-driven, data-driven, and reduction (or demand) driven. The control-driven system utilizes a centralized control such as a program counter or a master processor to control processing by the slave processors. An example of a control-driven machine is the Non-Von-l machine at Columbia University. In data-driven systems, control of the system results from the actual arrival of data required for processing. An example of a data-driven machine is the University of Manchester dataflow machine developed in England by Ian Watson. Reduction driven systems control processing when the processed activity demands results to occur. An example of a reduction processor is the MAGO reduction machine being developed at the University of North Carolina, Chapel Hill. The characteristics of the non-Von-l machine, the Manchester machine, and the MAGO reduction machine are carefully discussed in Davis, "Computer Architecture," IEEE Spectrum, November, 1983. In comparison, data-driven and demand-driven systems are decentralized approaches whereas control-driven systems represent a centralized approach. The present invention is more properly categorized in a fourth classification which could be termed "time-driven." Like data-driven and demand-driven systems, the control system of the present invention is decentralized. However, like the control-driven system, the present invention conducts processing when an activity is ready for execution.

Most computer systems involving parallel processing concepts have proliferated from a large number of different types of computer architectures. In such cases, the unique nature of the computer architecture mandates or requires either its own processing language or substantial modification of an existing language to be adapted for use. To take advantage of the highly parallel structure of such computer architectures, the programmer is required to have an intimate knowledge of the computer architecture in order to write the necessary software. As a result, preparing programs for these machines requires substantial amounts of the users effort, money and time.

Concurrent to this activity, work has also been progressing on the creation of new software and languages, independent of a specific computer architecture, that will expose (in a more direct manner), the inherent parallelism of the computation process. However, most effort in designing supercomputers has been concentrated in developing new hardware with much less effort directed to developing new software.

Davis has speculated that the best approach to the design of a fifth-generation machine is to concentrate efforts on the mapping of the concurrent program tasks in the software onto the physical hardware resources of the computer architecture. Davis terms this approach one of "task-allocation" and touts it as being the ultimate key to successful fifth-generation architectures. He categorizes the allocation strategies into two generic types. "Static allocations" are performed once, prior to execution, whereas "dynamic allocations" are performed by the hardware whenever the program is executed or run. The present invention utilizes a static allocation strategy and provides task allocations for a given program after compilation and prior to execution. The recognition of the "task allocation" approach in the design of fifth generation machines was used by Davis in the design of his "Data-driven Machine-II" constructed at the University of Utah. In the Data-driven Machine-II, the program was compiled into a program graph that resembles the actual machine graph or architecture.

Task allocation is also referred to as "scheduling" in Gajski et al, "Essential Issues in Multi-processor Systems," Computer, June, 1985. Gajski et al set forth levels of scheduling to include high level, intermediate level, and low level scheduling. The present invention is one of low-level scheduling, but it does not use conventional scheduling policies of "first-in-first-out", "round-robin", "shortest type in job-first", or "shortest-remaining-time." Gajski et al also recognize the advantage of static scheduling in that overhead costs are paid at compile time. However, Gajewski et al's recognized disadvantage, with respect to static scheduling, of possible inefficiencies in guessing the run time profile of each task is not found in the present invention. Therefore, the conventional approaches to low-level static scheduling found in the Occam language and the Bulldog compiler are not found in the software portion of the present invention. Indeed, the low-level static scheduling of the present invention provides the same type, if not better, utilization of the processors commonly seen in dynamic scheduling by the machine at run time. Furthermore, the low-level static scheduling of the present invention is performed automatically without intervention of programmers as required (for example) in the Occam language.

Davis further recognizes that communication is a critical feature in concurrent processing in that the actual physical topology of the system significantly influences the overall performance of the system.

For example, the fundamental problem found in most data-flow machines is the large amount of communication overhead in moving data between the processors. When data is moved over a bus, significant overhead, and possible degradation of the system, can result if data must contend for access to the bus. For example, the Arvind data-flow machine, referenced in Davis, utilizes an I-structure stream in order to allow the data to remain in one place which then becomes accessible by all processors. The present invention, in one aspect, teaches a method of hardware and software based upon totally coupling the hardware resources thereby significantly simplifying the communication problems inherent in systems that perform multiprocessing.

Another feature of non-Von Neumann type multiprocessor systems is the level of granularity of the parallelism being processed. Gajski et al term this "partitioning." The goal in designing a system, according to Gajski et al, is to obtain as much parallelism as possible with the lowest amount of overhead. The present invention performs concurrent processing at the lowest level available, the "per instruction" level. The present invention, in another aspect, teaches a method whereby this level of parallelism is obtainable without execution time overhead.

Despite all of the work that has been done with multiprocessor parallel machines, Davis (Id. at 99) recognizes that such software and/or hardware approaches are primarily designed for individual tasks and are not universally suitable for all types of tasks or programs as has been the hallmark with Von Neumann architectures. The present invention sets forth a computer system and method that is generally suitable for many different types of tasks since it operates on the natural concurrencies existent in the instruction stream at a very fine level of granularity.

All general purpose computer systems and many special purpose computer systems have operating systems or monitor/control programs which support the processing of multiple activities or programs. In some cases this processing occurs simultaneously; in other cases the processing alternates among the activities such that only one activity controls the processing resources at any one time. This latter case is often referred to as time sharing, time slicing, or concurrent (versus simultaneous) execution, depending on the particular computer system. Also depending on the specific system, these individual activities or programs are usually referred to as tasks, processes, or contexts. In all cases, there is a method to support the switching of control among these various programs and between the programs and the operating system, which is usually referred to as task switching, process switching, or context switching. Throughout this document, these terms are considered synonymous, and the terms context and context switching are generally used.

The present invention, therefore, pertains to a non-Von Neumann MIMD computer system capable of simultaneously operating upon many different and conventional programs by one or more different users. The natural concurrencies in each program are statically allocated, at a very fine level of granularity, and intelligence is added to the instruction stream at essentially the object code level. The added intelligence can include, for example, a logical processor number and an instruction firing time in order to provide the time-driven decentralized control for the present invention. The detection and low level scheduling of the natural concurrencies and the adding of the intelligence occurs only once for a given program, after conventional compiling of the program, without user intervention and prior to execution. The results of this static allocation are executed on a system containing a plurality of processor elements. In one embodiment of the invention, the processors are identical. The processor elements, in this illustrated embodiment, contain no execution state information from the execution of previous instructions, that is, they are context free. In addition, a plurality of context files, one for each user, are provided wherein the plurality of processor elements can access any storage resource contained in any context file through total coupling of the processor element to the shared resource during the processing of an instruction. In a preferred aspect of the present invention, no condition code or results registers are found on the individual processor elements.

SUMMARY OF INVENTION

The present invention provides a method and a system that is non-Von Neumann and one which is adaptable for use in single or multiple context SISD, SIMD, and MIMD configurations. The method and system is further operative upon a myriad of conventional programs without user intervention.

In one aspect, the present invention statically determines at a very fine level of granularity, the natural concurrencies in the basic blocks (BBs) of programs at essentially the object code level and adds intelligence to the instruction stream in each basic block to provide a time driven decentralized control. The detection and low level scheduling of the natural concurrencies and the addition of the intelligence occurs only once for a given program after conventional compiling and prior to execution. At this time, prior to program execution, the use during later execution of all instruction resources is assigned.

In another aspect, the present invention further executes the basic blocks containing the added intelligence on a system containing a plurality of processor elements each of which, in this particular embodiment, does not retain execution state information from prior operations. Hence, all processor elements in accordance with this embodiment of the invention are context free. Instructions are selected for execution based on the instruction firing time. Each processor element in this embodiment is capable of executing instructions on a per-instruction basis such that dependent instructions can execute on the same or different processor elements. A given processor element in the present invention is capable of executing an instruction from one context followed by an instruction from another context. All operating and context information necessary for processing a given instruction is then contained elsewhere in the system.

It should be noted that many alternative implementations of context free processor elements are possible. In a non-pipelined implementation each processor element is monolithic and executes a single instruction to its completion prior to accepting another instruction.

In another aspect of the invention, the context free processor is a pipelined processor element, in which each instruction requires several machine instruction clock cycles to complete. In general, during each clock cycle, a new instruction enters the pipeline and a completed instruction exists the pipeline, giving an effective instruction execution time of a single instruction clock cycle. However, it is also possible to microcode some instructions to perform complicated functions requiring many machine instruction cycles. In such cases the entry of new instructions is suspended until the complex instruction completes, after which the normal instruction entry and exit sequence in each clock cycle continues. Pipelining is a standard processor implementation technique and is discussed in more detail later.

The system and method of the present invention are described in the following drawing and specification.

DESCRIPTION OF THE DRAWING

Other objects, features, and advantages of the invention will appear from the following description taken together with the drawings in which:

FIG. 1 is the generalized flow representation of the TOLL software of the present invention;

FIG. 2 is a graphic representation of a sequential series of basic blocks found within the conventional compiler output;

FIG. 3 is a graphical presentation of the extended intelligence added to each basic block according to one embodiment of the present invention;

FIG. 4 is a graphical representation showing the details of the extended intelligence added to each instruction within a given basic block according to one embodiment of the present invention;

FIG. 5 is the breakdown of the basic blocks into discrete execution sets;

FIG. 6 is a block diagram presentation of the architectural structure of apparatus according to a preferred embodiment of the present invention;

FIGS. 7a-7c represent an illustration of the network interconnections during three successive instruction firing times;

FIGS. 8-11 are the flow diagrams setting forth features of the software according to one embodiment of the present invention;

FIG. 12 is a diagram describing one preferred form of the execution sets in the TOLL software;

FIG. 13 sets forth the register file organization according to a preferred embodiment of the present invention;

FIG. 14 illustrates a transfer between registers in different levels during a subroutine call;

FIG. 15 sets forth the structure of a logical resource driver (LRD) according to a preferred embodiment of the present invention;

FIG. 16 sets forth the structure of an instruction cache control and of the caches according to a preferred embodiment of the present invention;

FIG. 17 sets forth the structure of a PIQ buffer unit and a PIQ bus interface unit according to a preferred embodiment of the present invention;

FIG. 18 sets forth interconnection of processor elements through the PE-LRD network to a PIQ processor alignment circuit according to a preferred embodiment of the present invention;

FIG. 19 sets forth the structure of a branch execution unit according to a preferred embodiment of the present invention;

FIG. 20 illustrates the organization of the condition code storage of a context file according to a preferred embodiment of the present invention;

FIG. 21 sets forth the structure of one embodiment of a pipelined processor element according to the present invention; and

FIGS. 22(a) through 22(d) set forth the data structures used in connection with the processor element of FIG. 21.

GENERAL DESCRIPTION

1. Introduction

In the following two sections, a general description of the software and hardware of the present invention takes place. The system of the present invention is designed based upon a unique relationship between the hardware and software components. While many prior art approaches have primarily provided for multiprocessor parallel processing based upon a new architecture design or upon unique software algorithms, the present invention is based upon a unique hardware/software relationship. The software of the present invention provides the intelligent information for the routing and synchronization of the instruction streams through the hardware. In the performance of these tasks, the software spatially and temporally manages all user accessible resources, for example, general registers, condition code storage registers, memory and stack pointers. The routing and synchronization are performed without user intervention, and do not require changes to the original source code. Additionally, the analysis of an instruction stream to provide the additional intelligent information for controlling the routing and synchronization of the instruction stream is performed only once during the program preparation process (often called "static allocation") of a given piece of software, and is not performed during execution (often called "dynamic allocation") as is found in some conventional prior art approaches. The analysis effected according to the invention is hardware dependent, is performed on the object code output from conventional compilers, and advantageously, is therefore programming language independent.

In other words, the software, according to the invention, maps the object code program onto the hardware of the system so that it executes more efficiently than is typical of prior art systems. Thus the software must handle all hardware idiosyncrasies and their effects on execution of the program instructions stream. For example, the software must accommodate, when necessary, processor elements which are either monolithic single cycle or pipelined.

2. General Software Description

Referring to FIG. 1, the software of the present invention, generally termed "TOLL," is located in a computer processing system 160. Processing system 160 operates on a standard compiler output 100 which is typically object code or an intermediate object code such as "p-code." The output of a conventional compiler is a sequential stream of object code instructions hereinafter referred to as the instruction stream. Conventional language processors typically perform the following functions in generating the sequential instruction stream:

1. lexical scan of the input text,

2. syntactical scan of the condensed input text including symbol table construction,

3. performance of machine independent optimization including parallelism detection and vectorization, and

4. an intermediate (PSEUDO) code generation taking into account instruction functionality, resources required, and hardware structural properties.

In the creation of the sequential instruction stream, the conventional compiler creates a series of basic blocks (BBs) which are single entry single exit (SESE) groups of contiguous instructions. See, for example, Alfred V. Aho and Jeffery D. Ullman, Principles of Compiler Design, Addison Wesley, 1979, pg. 6, 409, 412-413 and David Gries, Compiler Construction for Digital Computers, Wiley, 1971. The conventional compiler, although it utilizes basic block information in the performance of its tasks, provides an output stream of sequential instructions without any basic block designations. The TOLL software, in this illustrated embodiment of the present invention, is designed to operate on the formed basic blocks (BBs) which are created within a conventional compiler. In each of the conventional SESE basic blocks there is exactly one branch (at the end of the block) and there are no control dependencies. The only relevant dependencies within the block are those between the resources required by the instructions.

The output of the compiler 100 in the basic block format is illustrated in FIG. 2. Referring to FIG. 1, the TOLL software 110 of the present invention being processed in the computer 160 performs three basic determining functions on the compiler output 100. These functions are to analyze the resource usage of the instructions 120, extend intelligence for each instruction in each basic block 130, and to build execution sets composed of one or more basic blocks 140. The resulting output of these three basic functions 120, 130, and 140 from processor 160 is the TOLL software output 150 of the present invention.

As noted above, the TOLL software of the present invention operates on a compiler output 100 only once and without user intervention. Therefore, for any given program, the TOLL software need operate on the compiler output 100 only once.

The functions 120, 130, 140 of the TOLL software 110 are, for example, to analyze the instruction stream in each basic block for natural concurrencies, to perform a translation of the instruction stream onto the actual hardware system of the present invention, to alleviate any hardware induced idiosyncrasies that may result from the translation process, and to encode the resulting instruction stream into an actual machine language to be used with the hardware of the present invention. The TOLL software 110 performs these functions by analyzing the instruction stream and then assigning processor elements and resources as a result thereof. In one particular embodiment, the processors are context free. The TOLL software 110 provides the "synchronization" of the overall system by, for example, assigning appropriate firing times to each instruction in the output instruction stream.

Instructions can be dependent on one another in a variety of ways although there are only three basic types of dependencies. First, there are procedural dependencies due to the actual structure of the instruction stream; that is, instructions may follow one another in other than a sequential order due to branches, jumps, etc. Second, operational dependencies are due to the finite number of hardware elements present in the system. These hardware elements include the general registers, condition code storage, stack pointers, processor elements, and memory. Thus if two instructions are to execute in parallel, they must not require the same hardware element unless they are both reading that element (provided of course, that the element is capable of being read simultaneously). Finally, there are data dependencies between instructions in the instruction stream. This form of dependency will be discussed at length later and is particularly important if the processor elements include pipelined processors. Within a basic block, however, only data and operational dependencies are present.

The TOLL software 110 must maintain the proper execution of a program. Thus, the TOLL software must assure that the code output 150, which represents instructions which will execute in parallel, generates the same results as those of the original serial code. To do this, the code 150 must access the resources in the same relative sequence as the serial code for instructions that are dependent on one another; that is, the relative ordering must be satisfied. However, independent sets of instructions may be effectively executed out of sequence.

In Table 1 is set forth an example of a SESE basic block representing the inner loop of a matrix multiply routine. While, this example will be used throughout this specification, the teachings of the present invention are applicable to any instruction stream. Referring to Table 1, the instruction designation is set forth in the right hand column and a conventional object code functional representation, for this basic block, is represented in the left hand column.

TABLE 1 ______________________________________ OBJECT CODE INSTRUCTION ______________________________________ LD R0, (R10) + I0 LD R1, (R11) + I1 MM R0, R1, R2 I2 ADD R2, R3, R3 I3 DEC R4 I4 BRNZR LOOP I5 ______________________________________

The instruction stream contained within the SESE basic block set forth in Table 1 performs the following functions. In instruction I0, register R0 is loaded with the contents of memory whose address is contained in R10. The instruction shown above increments the contents of R10 after the address has been fetched from R10. The same statement can be made for instruction I1, with the exception that register R1 is loaded and register R11 is incremented. Instruction I2 causes the contents of registers R0 and R1 to be multiplied and the result is stored in register R2. In instruction I3, the contents of register R2 and register R3 are added and the result is stored in register R3. In instruction I4, register R4 is decremented. Instructions I2, I3 and I4 also generate a set of condition codes that reflect the status of their respective execution. In instruction I5, the contents of register R4 are indirectly tested for zero (via the condition codes generated by instruction I4). A branch occurs if the decrement operation produced a non-zero value; otherwise execution proceeds with the first instruction of the next basic block.

Referring to FIG. 1, the first function performed by the TOLL software 110 is to analyze the resource usage of the instructions. In the illustrated example, these are instructions I0 through I5 of Table I. The TOLL software 110 thus analyzes each instruction to ascertain the resource requirements of the instruction.

This analysis is important in determining whether or not any resources are shared by any instructions and, therefore, whether or not the instructions are independent of one another. Clearly, mutually independent instructions can be executed in parallel and are termed "naturally concurrent." Instructions that are independent can be executed in parallel and do not rely on one another for any information nor do they share any hardware resources in other than a read only manner.

On the other hand, instructions that are dependent on one another can be formed into a set wherein each instruction in the set is dependent on every other instruction in that set. The dependency may not be direct. The set can be described by the instructions within the set, or conversely, by the resources used by the instructions in the set. Instructions within different sets are completely independent of one another, that is, there are no resources shared by the sets. Hence, the sets are independent of one another.

In the example of Table 1, the TOLL software will determine that there are two independent sets of dependent instructions:

______________________________________ Set 1: CC1: I0, I1, I2, I3 Set 2: CC2: I4, I5 ______________________________________

As can be seen, instructions I4 and I5 are independent of instructions I0-I3. In set 2, I5 is directly dependent on I4. In set 1, I2 is directly dependent on I0 and I1. Instruction I3 is directly dependent on I2 and indirectly dependent on I0
and I1.

The TOLL software of the present invention detects these independent sets of dependent instructions and assigns a condition code group of designation(s), such as CC1 and CC2, to each set. This avoids the operational dependency that would occur if only one group or set of condition codes were available to the instruction stream.

In other words, the results of the execution of instructions I0 and I1 are needed for the execution of instruction I2. Similarly, the results of the execution of instruction I2 are needed for the execution of instruction I3. In performing this analyses, the TOLL software 110 determines if an instruction will perform a read and/or a write to a resource. This functionality is termed the resource requirement analysis of the instruction stream.

It should be noted that, unlike the teachings of the prior art, the present invention teaches that it is not necessary for dependent instructions to execute on the same processor element. The determination of dependencies is needed only to determine condition code sets and to determine instruction firing times, as will be described later. The present invention can execute dependent instructions on different processor elements, in one illustrated embodiment, because of the context free nature of the processor elements and the total coupling of the processor elements to the shared resources, such as the register files, as will also be described below.

The results of the analysis stage 120, for the example set forth in Table 1, are set forth in Table 2.

TABLE 2 ______________________________________ INSTRUCTION FUNCTION ______________________________________ I0 Memory Read, Reg. Write, Reg. Read & Write I1 Memory Read, Reg. Write, Reg. Read & write I2 Two Reg. Reads, Reg. Write, Set Cond. Code (Set #1) I3 Two Reg. Reads, Reg. Write, Set Cond. Code (Set #1) I4 Read Reg., Reg. Write, Set Cond. Code (Set #2) I5 Read Cond. Code (Set #2) ______________________________________

In Table 2, for instructions I0 and I1, a register is read and written followed by a memory read (at a distinct address), followed by a register write. Likewise, condition code writes and register reads and writes occur for instructions I2
through I4. Finally, instruction I5 is a simple read of a condition code storage register and a resulting branch or loop.

The second step or pass 130 through the SESE basic block 100 is to add or extend intelligence to each instruction within the basic block. In the preferred embodiment of the invention, this is the assignment of an instruction's execution time relative to the execution times of the other instructions in the stream, the assignment of a processor number on which the instruction is to execute and the assignment of any so-called static shared context storage mapping information that may be needed by the instruction.

In order to assign the firing time to an instruction, the temporal usage of each resource required by the instruction must be considered. In the illustrated embodiment, the temporal usage of each resource is characterized by a "free time" and a "load time." The free time is the last time the resource was read or written by an instruction. The load time is the last time the resource was modified by an instruction. If an instruction is going to modify a resource, it must execute the modification after the last time the resource was used, in other words, after the free time. If an instruction is going to read the resource, it must perform the read after the last time the resource has been loaded, in other words, after the load time.

The relationship between the temporal usage of each resource and the actual usage of the resource is as follows. If an instruction is going to write/modify the resource, the last time the resource is read or written by other instructions (i.e., the "free time" for the resource) plus one time interval will be the earliest firing time for this instruction. The "plus one time interval" comes from the fact that an instruction is still using the resource during the free time. On the other hand, if the instruction reads a resource, the last time the resource is modified by other instructions (i.e., the load time for the resource) plus one time interval will be the earliest instruction firing time. The "plus one time interval" comes from the time required for the instruction that is performing the load to execute.

The discussion above assumes that the exact location of the resource that is accessed is known. This is always true of resources that are directly named such as general registers and condition code storage. However, memory operations may, in general, be to locations unknown at compile time. In particular, addresses that are generated by effective addressing constructs fall in this class. In the previous example, it has been assumed (for the purposes of communicating the basic concepts of TOLL) that the addresses used by instructions I0 and I1 are distinct. If this were not the case, the TOLL software would assure that only those instructions that did not use memory would be allowed to execute in parallel with an instruction that was accessing an unknown location in memory.

The instruction firing time is evaluated by the TOLL software 110 for each resource that the instruction uses. These "candidate" firing times are then compared to determine which is the largest or latest time. The latest time determines the actual firing time assigned to the instruction. At this point, the TOLL software 110 updates all of the resources free and load times, to reflect the firing time assigned to the instruction. The TOLL software 110 then proceeds to analyze the next instruction.

There are many methods available for determining inter-instruction dependencies within a basic block. The previous discussion is just one possible implementation assuming a specific compiler-TOLL partitioning. Many other compiler-TOLL partitionings and methods for determining inter-instruction dependencies may be possible and realizable to one skilled in the art. Thus, the illustrated TOLL software uses a linked list analysis to represent the data dependencies within a basic block. Other possible data structures that could be used are trees, stacks, etc.

Assume a linked list representation is used for the analysis and representation of the inter-instruction dependencies. Each register is associated with a set of pointers to the instructions that use the value contained in that register. For the matrix multiply example in Table 1, the resource usage is set forth in Table 3:

TABLE 3 ______________________________________ Resource Loaded By Read By ______________________________________ R0 I0 I2 R1 I1 I2 R2 I2 I3 R3 I3 I3, I2 R4 I4 I5 R10 I0 I0 R11 I1 I1 ______________________________________

Thus, by following the "Read by" links and knowing the resource utilization for each instruction, the independencies of Sets 1 and 2, above, are constructed in the analyze instruction stage 120 (FIG. 1) by TOLL 110.

For purposes of analyzing further the example of Table 1, it is assumed that the basic block commences with an arbitrary time interval in an instruction stream, such as, for example, time interval T16. In other words, this particular basic block in time sequence is assumed to start with time interval T16. The results of the analysis in stage 120 are set forth in Table 4.

TABLE 4 ______________________________________ REG I0 I1 I2 I3 I4 I5 ______________________________________ R0 T16 T17 R1 T16 T17 R2 T17 T18 R3 T18 R4 T16 CC1 T17 T18 CC2 T17 R10 T16 R11 T16 ______________________________________

The vertical direction in Table 4 represents the general registers and condition code storage registers. The horizontal direction in the table represents the instructions in the basic block example of Table 1. The entries in the table represent usage of a register by an instruction. Thus, instruction I0 requires that register R10 be read and written and register R0 written at time T16, the start of execution of the basic block.

Under the teachings of the present invention, there is no reason that registers R1, R11, and R4 cannot also have operations performed on them during time T16. The three instructions, I0, I1, and I4, are data independent of each other and can be executed concurrently during time T16. Instruction I2, however, requires first that registers R0 and R1 be loaded so that the results of the load operation can be multiplied. The results of the multiplication are stored in register R2. Although, register R2 could in theory be operated on in time T16, instruction I2 is data dependent upon the results of loading registers R0 and R1, which occurs during time T16. Therefore, the completion of instruction I2 must occur during or after time frame T17. Hence, in Table 4 above, the entry T17 for the intersection of instruction I2 and register R2 is underlined because it is data dependent. Likewise, instruction I3 requires data in register R2 which first occurs during time T17. Hence, instruction I3 can operate on register R2 only during or after time T18. Instruction I5 depends upon the reading of the condition code storage CC2 which is updated by instruction I4. The reading of the condition code storage CC2 is data dependent upon the results stored in time T16 and, therefore, must occur during or after the next time, T17.

Hence, in stage 130, the object code instructions are assigned "instruction firing times" (IFTs) as set forth in Table 5 based upon the above analysis.

TABLE 5 ______________________________________ OBJECT CODE INSTRUCTION FIRING INSTRUCTION TIME (IFT) ______________________________________ I0 T16 I1 T16 I2 T17 I3 T18 I4 T16 I5 T17 ______________________________________

Each of the instructions in the sequential instruction stream in a basic block can be performed in the assigned time intervals. As is clear in Table 5, the same six instructions of Table 1, normally processed sequentially in six cycles, can be processed, under the teachings of the present invention, in only three firing times: T16, T17, and T18. The instruction firing time (IFT) provides the "time-driven" feature of the present invention.

The next function performed by stage 130, in the illustrated embodiment, is to reorder the natural concurrencies in the instruction stream according to instruction firing times (IFTs) and then to assign the instructions to the individual logical parallel processors. It should be noted that the reordering is only required due to limitations in currently available technology. If true fully associative memories were available, the reordering of the stream would not be required and the processor numbers could be assigned in a first come, first served manner. The hardware of the instruction selection mechanism could be appropriately modified by one skilled in the art to address this mode of operation.

For example, assuming currently available technology, and a system with four parallel processor elements (PEs) and a branch execution unit (BEU) within each LRD, the processor elements and the branch execution unit can be assigned, under the teachings of the present invention, as set forth in Table 6 below. It should be noted that the processor elements execute all non-branch instructions, while the branch execution unit (BEU) of the present invention executes all branch instructions. These hardware circuitries will be described in greater detail subsequently.

TABLE 6 ______________________________________ Logical Processor Number T16 T17 T18 ______________________________________ 0 I0 I2 I3 1 I1 -- 2 I4 -- -- 3 -- -- -- BEU -- I5 (delay) -- ______________________________________

Hence, under the teachings of the present invention, during time interval T16, parallel processor elements 0, 1, and 2 concurrently process instructions I0, I1, and I4 respectively. Likewise, during the next time interval T17, parallel processor element 0 and the BEU concurrently process instructions I2 and I5 respectively. And finally, during time interval T18, processor element 0 processes instruction I3. During instruction firing times T16, T17, and T18, parallel processor element 3 is not utilized in the example of Table 1. In actuality, since the last instruction is a branch instruction, the branch cannot occur until the last processing is finished in time T18 for instruction I3. A delay field is built into the processing of instruction I5 so that even though it is processed in time interval T17 (the earliest possible time), its execution is delayed so that looping or branching out occurs after instruction I3 has executed.

In summary, the TOLL software 110 of the present illustrated embodiment, in stage 130, examines each individual instruction and its resource usage both as to type and as to location (if known) (e.g., Table 3). It then assigns instruction firing times (IFTs) on the basis of this resource usage (e.g., Table 4), reorders the instruction stream based upon these firing times (e.g., Table 5) and assigns logical processor numbers (LPNs) (e.g., Table 6) as a result thereof.

The extended intelligence information involving the logical processor number (LPN) and the instruction firing time (IFT) is, in the illustrated embodiment, added to each instruction of the basic block as shown in FIGS. 3 and 4. As will also be pointed out subsequently, the extended intelligence (EXT) for each instruction in a basic block (BB) will be correlated with the actual physical processor architecture of the present invention. The correlation is performed by the system hardware. It is important to note that the actual hardware may contain less, the same as, or more physical processor elements than the number of logical processor elements.

The Shared Context Storage Mapping (SCSM) information in FIG. 4 and attached to each instruction in this illustrated and preferred embodiment of the invention, has a static and a dynamic component. The static component of the SCSM information is attached by the TOLL software or compiler and is a result of the static analysis of the instruction stream. Dynamic information is attached at execution time by a logical resource drive (LRD) as will be discussed later.

At this stage 130, the illustrated TOLL software 110 has analyzed the instruction stream as a set of single entry single exit (SESE) basic blocks (BBs) for natural concurrencies that can be processed individually by separate processor elements (PEs) and has assigned to each instruction an instruction firing time (IFT) and a logical processor number (LPN). Under the teachings of the present invention, the instruction stream is thus pre-processed by the TOLL software to statically allocate all processing resources in advance of execution. This is done once for any given program and is applicable to any one of a number of different program languages such as FORTRAN, COBOL, PASCAL, BASIC, etc.

Referring to FIG. 5, a series of basic blocks (BBs) can form a single execution set (ES) and in stage 140, the TOLL software 110 builds such execution sets (ESs). Once the TOLL software identifies an execution set 500, header 510 and/or trailer
520 information is added at the beginning and/or end of the set. In the preferred embodiment, only header information 510 is attached at the beginning of the set, although the invention is not so limited.

Under the teachings of the present invention, basic blocks generally follow one another in the instruction stream. There may be no need for reordering of the basic blocks even though individual instructions within a basic block, as discussed above, are reordered and assigned extended intelligence information. However, the invention is not so limited. Each basic block is single entry and single exit (SESE) with the exit through a branch instruction. Typically, the branch to another instruction is within a localized neighborhood such as within 400 instructions of the branch. The purpose of forming the execution sets (stage 140) is to determine the minimum number of basic blocks that can exist within an execution set such that the number of "instruction cache faults" is minimized. In other words, in a given execution set, branches or transfers out of an execution set are statistically minimized. The TOLL software in stage 140, can use a number of conventional techniques for solving this linear programming-like problem, a problem which is based upon branch distances and the like. The purpose is to define an execution set as set forth in FIG. 5 so that the execution set can be placed in a hardware cache, as will be discussed subsequently, to minimize instruction cache faults (i.e., transfers out of the execution set).

What has been set forth above is an example, illustrated using Tables 1 through 6, of the TOLL software 110 in a single context application. In essence, the TOLL software determines the natural concurrencies within the instruction streams for each basic block within a given program. The TOLL software adds, in the illustrated embodiment, an instruction firing time (IFT) and a logical processor number (LPN) to each instruction in accordance with the determined natural concurrencies. All processing resources are statically allocated in advance of processing. The TOLL software of the present invention can be used in connection with a number of simultaneously executing different programs, each program being used by the same or different users on a processing system of the present invention as will be described and explained below.

3. General Hardware Description

Referring to FIG. 6, the block diagram format of the system architecture of the present invention, termed the TDA system architecture 600, includes a memory sub-system 610 interconnected to a plurality of logical resource drivers (LRDs) 620 over a network 630. The logical resource drivers 620 are further interconnected to a plurality of processor elements 640 over a network 650. Finally, the plurality of processor elements 640 are interconnected over a network 670 to the shared resources containing a pool of register set and condition code set files 660. The LRD-memory network 630, the PE-LRD network 650, and the PE-context file network 670 are full access networks that could be composed of conventional crossbar networks, omega networks, banyan networks, or the like. The networks are full access (non-blocking in space) so that, for example, any processor element 640 can access any register file or condition code storage in any context (as defined hereinbelow) file 660. Likewise, any processor element 640 can access any logical resource driver 620 and any logical resource driver 620 can access any portion of the memory subsystem 610. In addition, the PE-LRD and PE-context file networks are non-blocking in time. In other words, these two networks guarantee access to any resource from any resource regardless of load conditions on the network. The architecture of the switching elements of the PE-LRD network 650 and the PE-context file network 670 are considerably simplified since the TOLL software guarantees that collisions in the network will never occur. The diagram of FIG. 6 represents an MIMD system wherein each context file 660 corresponds to at least one user program.

The memory subsystem 610 can be constructed using a conventional memory architecture and conventional memory elements. There are many such architectures and elements that could be employed by a person skilled in the art and which would satisfy the requirements of this system. For example, a banked memory architecture could be used. (High Speed Memory Systems, A. V. Pohm and O. P. Agrawal, Reston Publishing Co., 1983.)

The logical resource drivers 620 are unique to the system architecture 600 of the present invention. Each illustrated LRD provides the data cache and instruction selection support for a single user (who is assigned a context file) on a timeshared basis. The LRDs receive execution sets from the various users wherein one or more execution sets for a context are stored on an LRD. The instructions within the basic blocks of the stored execution sets are stored in queues based on the previously assigned logical processor number. For example, if the system has 64 users and 8 LRDs, 8 users would share an individual LRD on a timeshared basis. The operating system determines which user is assigned to which LRD and for how long. The LRD is detailed at length subsequently.

The processor elements 640 are also unique to the TDA system architecture and will be discussed later. These processor elements in one particular aspect of the invention display a context free stochastic property in which the future state of the system depends only on the present state of the system and not on the path by which the present state was achieved. As such, architecturally, the context free processor elements are uniquely different from conventional processor elements in two ways. First, the elements have no internal permanent storage or remnants of past events such as general purpose registers or program status words. Second, the elements do not perform any routing or synchronization functions. These tasks are performed by the TOLL software and are implemented in the LRDs. The significance of the architecture is that the context free processor elements of the present invention are a true shared resource to the LRDs. In another preferred particular embodiment of the invention wherein pipelined processor elements are employed, the processors are not strictly context free as was described previously.

Finally, the register set and condition code set files 660 can also be constructed of commonly available components such as AMD 29300 series register files, available from Advanced Micro Devices, 901 Thompson Pl., P.O. Box 3453, Sunnyvale, Calif. 94088. However, the particular configuration of the files 660 illustrated in FIG. 6 is unique under the teachings of the present invention and will be discussed later.

The general operation of the present invention, based upon the example set forth in Table 1, is illustrated with respect to the processor-context register file communication in FIGS. 7a, 7b, and 7c. As mentioned, the time-driven control of the present illustrated embodiment of the invention is found in the addition of the extended intelligence relating to the logical processor number (LPN) and the instruction firing time (IFT) as specifically set forth in FIG. 4. FIG. 7 generally represents the configuration of the processor elements PE0 through PE3 with registers R0 through R5, . . . , R10 and R11 of the register set and condition code set file 660.

In explaining the operation of the TDA system architecture 600 for the single user example in Table 1, reference is made to Tables 3 through 5. In the example, for instruction firing time T16, the context file-PE network 670 interconnects processor element PE0 with registers R0 and R10, processor element PE1 with registers R1 and R1l, and processor element PE2 with register R4. Hence, during time T16, the three processor elements PE0, PE1, and PE2 process instructions I0, I1, and I4
concurrently and store the results in registers R0, R10, R1, R11, and R4. During time T16, the LRD 620 selects and delivers the instructions that can fire (execute) during time T17 to the appropriate processor elements. Referring to FIG. 7b, during instruction firing time T17, only processor element PE0, which is now assigned to process instruction I2 interconnects with registers R0, R1, and R2. The BEU (not shown in FIGS. 7a, 7b, and 7c) is also connected to the condition code storage. Finally, referring to FIG. 7c, during instruction firing time T18, processor element PE0 is connected to registers R2 and R3.

Several important observations need to be made. First, when a particular processor element (PE) places results of its operation in a register, any processor element, during a subsequent instruction firing time (IFT), can be interconnected to that register as it executes its operation. For example, processor element PE1 for instruction I1 loads register R1 with the contents of a memory location during IFT T16 as shown in FIG. 7a. During instruction firing time T17, processor element PE0 is interconnected with register R1 to perform an additional operation on the results stored therein. Under the teachings of the present invention, each processor element (PE) is "totally coupled" to the necessary registers in the register file 660 during any particular instruction firing time (IFT) and, therefore, there is no need to move the data out of the register file for delivery to another resource; e.g. in another processor's register as in some conventional approaches.

In other words, under the teachings of the present invention, each processor element can be totally coupled, during any individual instruction firing time, to any shared register in files 660. In addition, under the teachings of the present invention, none of the processor elements has to contend (or wait) for the availability of a particular register or for results to be placed in a particular register as is found in some prior art systems. Also, during any individual firing time, any processor element has full access to any configuration of registers in the register set file 660 as if such registers were its own internal registers.

Hence, under the teachings of the present invention, the intelligence added to the instruction stream is based upon detected natural concurrencies within the object code. The detected concurrencies are analyzed by the TOLL software, which in one illustrated embodiment logically assigns individual logical processor elements (LPNs) to process the instructions in parallel, and unique firing times (IFTs) so that each processor element (PE), for its given instruction, will have all necessary resources available for processing according to its instruction requirements. In the above example, the logical processor numbers correspond to the actual processor assignment, that is, LPN0 corresponds to PE0, LPN1 to PE1, LPN2 to PE2, and LPN3 to PE3. The invention is not so limited since any order such as LPN0 to PE1, LPN1 to PE2, etc. could be used. Or, if the TDA system had more or less than four processors, a different assignment could be used as will be discussed.

The timing control for the TDA system is provided by the instruction firing times, that is, the system is timedriven. As can be observed in FIGS. 7a through 7c, during each individual instruction firing time, the TDA system architecture composed of the processor elements 640 and the PE-register set file network 670, takes on a new and unique particular configuration fully adapted to enable the individual processor elements to concurrently process instructions while making full use of all the available resources. The processor elements can be context free and thereby data, condition, or information relating to past processing is not required, nor does it exist, internally to the processor element. The context free processor elements react only to the requirements of each individual instruction and are interconnected by the hardware to the necessary shared registers.

4. Summary

In summary, the TOLL software 110 for each different program or compiler output 100 analyzes the natural concurrencies existing in each single entry, single exit (SESE) basic block (BB) and adds intelligence, including in one illustrated embodiment, a logical processor number (LPN) and an instruction firing time (IFT), to each instruction. In an MIMD system of the present invention as shown in FIG. 6, each context file would contain data from a different user executing a program. Each user is assigned a different context file and, as shown in FIG. 7, the processor elements (PEs) are capable of individually accessing the necessary resources such as registers and condition codes storage required by the instruction. The instruction itself carries the shared resource information (that is, the registers and condition code storage). Hence, the TOLL software statically allocates only once for each program the necessary information for controlling the processing of the instruction in the TDA system architecture illustrated in FIG. 6 to insure a time-driven decentralized control wherein the memory, the logical resource drivers, the processor elements, and the context shared resources are totally coupled through their respective networks in a pure, non-blocking fashion.

The logical resource drivers (LRDs) 620 receive the basic blocks formed in an execution set and are responsible for delivering each instruction to the selected processor element 640 at the instruction firing time (IFT). While the example shown in FIG. 7 is a simplistic representation for a single user, it is to be expressly understood that the delivery by the logical resource driver 620 of the instructions to the processor elements 640, in a multi-user system, makes full use of the processor elements as will be fully discussed subsequently. Because the timing and the identity of the shared resources and the processor elements are all contained within the extended intelligence added to the instructions by the TOLL software, each processor element 640 can be completely (or in some instances substantially) context free and, in fact, from instruction firing time to instruction firing time can process individual instructions of different users as delivered by the various logical resource drivers. As will be explained, in order to do this, the logical resource drivers 620, in a predetermined order, deliver the instructions to the processor elements 640 through the PELRD network 650.

It is the context free nature of the processor elements which allows the independent access by any processor element of the results of data generation/manipulation from any other processor element following the completion of each instruction execution. In the case of processors which are not context free, in order for one processor to access data created by another, specific actions (usually instructions which move data from general purpose registers to memory) are required in order to extract the data from one processor and make it available to another.

It is also the context free nature of the processor elements that permits the true sharing of the processor elements by multiple LRDs. This sharing can be as finegrained as a single instruction cycle. No programming or special processor operations are needed to save the state of one context (assigned to one LRD), which has control of one or more processor elements, in order to permit control by another context (assigned to a second LRD). In processors which are not context free, which is the case for the prior art, specific programming and special machine operations are required in such state-saving as part of the process of context switching.

There is one additional alternative in implementing the processor elements of the present invention, which is a modification to the context free concept: an implementation which provides the physically total interconnection discussed above, but which permits, under program control, a restriction upon the transmission of generated data to the register file following completion of certain instructions.

In a fully context free implementation, at the completion of each instr