United States Patent5909559
SoJune 1, 1999

Title

Bus bridge device including data bus of first width for a first processor, memory controller, arbiter circuit and second processor having a different second data width

Abstract

An integrated circuit (2210) provides on a single chip for use with a first processor (106) off-chip, the following combination: first terminals (of 2232) for first processor-related signals and defining a first data width (32-bit), second terminals for external bus-related signals (PCI), third terminals for memory-related signals (of 2258), and a DRAM memory controller (2250) connected to the third terminals. Further on chip is provided an arbiter circuit (2230), a bus bridge circuit (2236) coupled to the DRAM memory controller and to the second terminals, the bus bridge (2236) also coupled to the arbiter (2230), a second processor (2224) having a second data width (16-bit), and a bus interface circuit (2220) coupling the second data width of the second processor (2224) to the first data width. The bus interface circuit (2220) further has bus master and bus slave circuitry coupled between the second processor (2224) and the arbiter circuit (2230). The bus bridge (2236), the bus interface (2220) and the first terminals and the DRAM memory controller (2250) have datapaths selectively interconnected in response to the arbiter circuit (2230). Other devices, systems and methods are also disclosed.


Inventors:So; John Ling Wing (Plano, TX)
Assignee:Texas Instruments Incorporated (Dallas, TX)
Appl. No.:832892
Filed:April 4, 1997

Current U.S. Class:710/307 
Current International Class:G06F 13/40 (20060101)
Field of Search:395/306-309,848,856,293

U.S. Patent Documents
4839797June 1989Katori et al.
5230039July 1993Grossman et al.
5266941November 1993Akeley et al.
5450551September 1995Amini et al.
5499344March 1996Elnashar et al.
5517650May 1996Bland et al.
5535340July 1996Bell et al.
5546546August 1996Bell et al.
5548730August 1996Young et al.
5590128December 1996Maloney et al.
5590342December 1996Marisetty
5594882January 1997Bell
5603014February 1997Woodring et al.
5619661April 1997Crews et al.
5623647April 1997Maitra
5625779April 1997Solomon et al.
5634114May 1997Shipley
5638525June 1997Hammond et al.
5640520June 1997Tetrick
5678064October 1997Kulik et al.
Foreign Patent Documents
WO97/00533Jan., 1997WO
WO97/06486Feb., 1997WO
Other References
J Peddie, "Focus--7 of 26 Stories", Computer & Communicatiosn OEM Magazine, Nov. 1, 1996 pp. 49-53. .
A. MacLellan, "Challenge to Microsoft, SGL to License 8 for OpenGL", Electonic News, Mar. 24, 1997. .
J. Yoshida, "Is DVD Becoming the TV/PC Bridge?", Electronic Engineering Times, Mar. 31, 1997, pp. 103. .
Sigma Designs Bets 3D Chip will Prove Magic, Electronic News, Dec. 9, 1996. .
"Analog Devices Announces First One-Chip Solution for PC Sound and Communications" PR Newswire Nov. 19, 1996. .
P. Glaskowsky, "First Media Processors Reach the Market", Microprocessor Report, Jan. 27, 1997, pp. 10-15. .
A. Hierhager, "Web Videoconferencing Making Waves" Electronic Engineering Times, Nov. 18, 1996. .
P. Clarke, "Phillips Rolls out Trimedia" EE Times, Mar. 10, 1997, p. 74. .
S. Oar, "Designers Find PC-Audio Path Strewn with Static", EE Times, Mar. 3, 1997. .
P. Glaskowsky, "Crystal SLIMD Speeds PCI Audio", Microprocessor Report, Mar. 31, 1997, pp. 13-15. .
T. Grimm et al, "Defining New Product Concepts and Architectures", Productivity Products for the Mobile Professional, Jan.-Feb. 1997, pp. 39,46-48. .
P. Buckley et al, "Choosing a Platform Archetecture for Cost Effective MPEG-2 Video Playback", Intel, Apr. 1996, pp. 3-28. .
T. Shanley et al, "Teleconferencing Performance Requirements", PCI System Architecture, pp. 14-17, 31,32,34, 1995. .
Chapter 5: "The Functional Signal Groups", PCI System Architecture, pp. 53-57. .
Chapter 6: "PCI Bus Arbitration", PCI System Architecture, pp. 78-88, 124-127. .
Chapter 11: "Interrupt-Related Issues", PCI System Architecture, pp. 209-216. .
Chapter 12: "Shared Resource Acquisition", PCI System Architecture, pp. 231-242. .
Chapter 13: "The 64-bit PCI Extension", PCI System Architecture, pp. 259-269. .
Chapter 15: "Intro to Configuration Address Space", PCI System Architecture, pp. 297-301. .
Chapter 17: "Configuration Registers", PCI System Architecture, pp. 329-351. .
Chapter 19: "PCI-to-PCI Bridge", PCI System Architecture, pp. 381-392. .
Chapter 21: "PCE Cache Support", PCI System Architecture, pp. 471-478,489. .
Chapter 23: "Overview of VL82C59X PCI Chipset", PCI System Architecture, pp. 505-516. .
L. Nederlof, "One-Chip TV", 1996 IEEE International Solid-State Circuits Conference, pp. 26-29. .
T. Mostad et al, "Designing a Low Cost, High Performance Platform for MPEG-1 Video Playback", Intel Corporation, pp. 3, 4, and 8. .
A. Tanenbaum, "Case Study 4: Mach", Modern Operating Systems, pp. 637-652, 660-680, 1992. .
F. Hady, "Efficient PCI Performance is no Mean Feat", EE Times, Mar. 10, 1997, pp. 104. .
Microsoft Corp. "Chapter 3 DirectSound" Version 4.04, Aug. 1996, pp. 1-9..~
Primary Examiner: Ray; Gopal C.
Attorney, Agent or Firm:Marshall, Jr.; Robert D. Laws; Gerald E. Donaldson; Richard L.

Claims


What is claimed is:
1. An integrated circuit comprising on a single chip for a first processor off-chip:
first terminals for first processor-related signals and defining a first data width;
second terminals for external bus-related signals;
third terminals for memory-related signals;
a DRAM memory controller connected to said third terminals;
an arbiter circuit;
a bus bridge circuit coupled to said DRAM memory controller and to said second terminals, said bus bridge also coupled to said arbiter;
a second processor having a second data width which is a different data width than said first data width;
a bus interface circuit coupling the second data width of said second processor to the first data width, said bus interface circuit further having bus master and bus slave circuitry coupled between said second processor and said arbiter circuit; and
said bus bridge, said bus interface and said first terminals and said DRAM memory controller having datapaths selectively interconnected in response to said arbiter circuit.

2. The integrated circuit of claim 1 having an interface port for coupling to a third processor of the same type as said second processor, said interface port having a datapath selectively interconnected with said first terminals and said DRAM memory controller in response to said arbiter.

3. The integrated circuit of claim 2 having a graphics port, said graphics port having a datapath selectively interconnected with said first terminals and said DRAM memory controller in response to said arbiter.

4. The integrated circuit of claim 1 having a graphics port, said graphics port having a datapath selectively interconnected with said first terminals and said DRAM memory controller in response to said arbiter.

5. The integrated circuit of claim 1 wherein said second processor comprises a digital signal processor (DSP).

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

(C) Copyright, *M* Texas Instruments Incorporated 1997. A portion of the disclosure of this patent document contains material which is subject to copyright and mask work protection. The copyright and mask work owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright and mask work rights whatsoever.

The following simultaneously filed, coassigned patent applications are hereby incorporated herein by reference:

U.S. patent application Ser. No. 08/833,267 entitled DEVICES, METHODS, SYSTEMS AND SOFTWARE PRODUCTS FOR COORDINATION OF COMPUTER MAIN MICROPROCESSOR AND SECOND MICROPROCESSOR COUPLED THERETO.

U.S. patent application Ser. No. 08/833,153 entitled DATA TRANSFER CIRCUITRY, DSP WRAPPER CIRCUITRY AND IMPROVED PROCESSOR DEVICES, METHODS AND SYSTEMS.

U.S. patent application Ser. No. 08/833,152 entitled COMPUTER PROCESSOR DRIVER METHODS, METHODOLOGY, DEVICES AND SYSTEMS.

U.S. patent application Ser. No. 08/833,266 entitled PROCESSOR INTERFACE ENHANCEMENTS METHODS, METHODOLOGY, DEVICES AND SYSTEMS.

U.S. patent application Ser. No. 08/823,251 filed Mar. 24, 1997 entitled PC CIRCUITS, SYSTEMS AND METHODS.

The following coassigned U.S. patents, U.S. patent applications, and laid-open foreign analogs, are hereby incorporated herein by reference:

______________________________________ Ser. Nr./Pat. Nr. Filing Date TI Case No. ______________________________________ 08/823,251 March 24, 1997 TI-21753 4,577,282 Feb. 22, 1982 TI-9062 4,912,636 Mar. 13, 1987 TI-11961 5,109,494 Dec. 31,
1987 TI-12541A 5,586,275 May 4, 1989 TI-14079A 5,471,292 Nov. 17, 1989 TI-14608C 5,594,914 Sept. 28, 1990 TI-15600A 5,754,837 Dec. 22, 1994 TI-18329 95309209.5 Dec. 18, 1995 TI-18329EU lay-open 0718747 5,737,748 Mar. 15, 1995 TI-20201
09/012,813 Jan. 24, 1997 TI-25311 ______________________________________

Digital signal processors can be adapted for voice recognition, voice synthesis, image processing, image recognition, and telephone communications for teleconferencing and videoteleconferencing. For example, Texas Instruments TMS320C2x, TMS320C5x, TMS320C54x, TMS320C3x and TMS320C4x, TMS320C6x and TMS320C8x DSP chips, as described in coassigned U.S. Pat. Nos. 5,072,418, and 5,099,417, and as to the C8x: coassigned U.S. Pat. No. 5,212,777 "SIMD/MIMD Reconfigurable Multi-Processor and Method of Operation"; coassigned U.S. Pat. No. 5,420,809, Ser. No. 08/160,116 filed Nov. 30, 1993 "Method, Apparatus and System Method for Correlation"; and above-cited U.S. patent application Ser. No. 09/012,813 (C6x) all of which patents and application are hereby incorporated herein by reference.

The above documents describe various computer systems, digital signal processors, and integrated circuits for use in those systems to further disclose some elements utilized in various inventive embodiments for purposes of the present patent application.

Other patent applications and patents are incorporated herein by reference by specific statements to that effect elsewhere in this application.

FIELD OF THE INVENTION

This invention generally relates to improved integrated circuits, computer systems, software products, and processes of operating integrated circuits and computers.

BACKGROUND OF THE INVENTION

Early computers required large amounts of space, occupying whole rooms. Since then minicomputers and desktop computers entered the marketplace.

Popular desktop computers have included the "Apple" (Motorola 680x0 microprocessor-based) and "IBM-compatible" (Intel or other x86 microprocessor-based) varieties, also known as personal computers (PCs) which have become very popular for office and home use. Also, high-end desk top computers called workstations based on a number of superscalar and other very-high-performance microprocessors such as the SuperSPARC microprocessor have been introduced.

In a further development, a notebook-size or palm-top computer is optionally battery powered for portable user applications. Such notebook and smaller computers challenge the art in demands for conflicting goals of miniaturization, ever higher speed, performance and flexibility, and long life between battery recharges. Also, a desktop enclosure called a docking station has the portable computer fit into the docking station, and improvements in such portable-computer/docking-station systems are desirable. However, all these systems are generally CPU-centric in the sense that the selection of the CPU determines the system's processing capabilities and add-in-cards are added to the CPU to add specific applications or functions, such as modem or multimedia.

Software for computers and the processes and concepts for developing and understanding both hardware and software have spawned an intricate terminology. For an introduction, see references hereby incorporated herein by reference, and listed below:

1. The Computer Glossary, by A. Freedman, AMACOM, American Management Association, New York, in various editions up to 1991 and later.

2. Modern Operating Systems, by A. S. Tanenbaum, Prentice-Hall, Englewood Cliffs, N.J. 1992.

3. Peripheral Component Interconnect (PCI) Bus Specification 2.0, 1993, by PCISIG (Special Interest Group), and its updates.

4. PCI System Architecture, by T. Shanley, Mindshare Press.

5. Microsoft Corporation: publications:

A. DirectSound Hardware Abstraction Layer

B. DirectSound Application Programming Interface (API)

C. Microsoft Windows: Guide to Programming, Software Development Kit.

6. Texas Instruments Incorporated: publications

A. TMS320C5x User's Guide, 1993.

B. TCM320ACXX Voice Band Audio Processor--Application Report

Improvements in circuits, integrated circuit devices, computer systems of all types, methods and processes of their operation, and software products, to address all the above-mentioned challenges, among others, are desirable, as described herein.

SUMMARY OF THE INVENTION

Generally, and in one form of the present invention, an integrated circuit provides on a single chip for use with a first processor off-chip, the following combination: first terminals for first processor-related signals and defining a first data width, second terminals for external bus-related signals, third terminals for memory-related signals, and a DRAM memory controller connected to the third terminals. Further on chip is provided an arbiter circuit, a bus bridge circuit coupled to the DRAM memory controller and to the second terminals, the bus bridge also coupled to the arbiter, a second processor having a second data width which differs from the first width, and a bus interface circuit coupling the second data width of said second processor to the first data width. The bus interface circuit further has bus master and bus slave circuitry coupled between the second processor and the arbiter circuit. The bus bridge, the bus interface and the first terminals and the DRAM memory controller have datapaths selectively interconnected in response to the arbiter circuit.

Other devices, systems and methods are also disclosed.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings:

FIG. 1 is a block diagram of improved integrated circuits and computer system embodiments for desktop and mobile computers, television sets, set-top boxes an appliances improved with asymmetrical multiprocessors;

FIG. 2 is a process diagram or method-of-operation diagram showing interrelated improved processes in a network videoconferencing and full-featured system of FIG. 1;

FIG. 3 is an electrical block diagram showing an improved computer system embodiment for telecom, audio, networking, and 3D graphics;

FIG. 4 is an electrical block diagram of another embodiment of an improved computer system for telecom, audio, networking, and 3D graphics;

FIG. 5 is an electrical block diagram of another embodiment of an improved computer system for telecom, audio, networking, and 3D graphics;

FIG. 6 is an electrical block diagram of another embodiment of an improved computer system for telecom, audio, networking, and 3D graphics;

FIG. 7 is an electrical block diagram of another embodiment of an improved computer system for hard disk drive control, telecom, 3D audio, networking, and graphics;

FIG. 8 is a block diagram of improved integrated circuits and computer system embodiments for desktop and mobile computers, television sets, set-top boxes and appliances improved with asymmetrical multiprocessors;

FIG. 9 is an electrical block diagram of another embodiment of an improved computer system for telecom, audio, networking, and graphics;

FIG. 10 is an electrical block diagram of another embodiment of an improved integrated circuit for use in computer system for telecom, audio, networking, and graphics;

FIG. 11 is an electrical block diagram of integrated circuits and buses in another embodiment of an improved computer system for telecom, audio, networking, and graphics;

FIG. 12 is an electrical block diagram of integrated circuits and buses in another embodiment of an improved computer system for telecom, audio, networking, and graphics;

FIG. 13 is an electrical block diagram of integrated circuits and buses in another embodiment of an improved computer system for telecom, audio, networking, and graphics;

FIG. 14 is a set of three bar charts comparing computer power (MIPS--millions of instructions per second) of three alternative systems: 1) a fixed function device, 2) media engine, and 3) the new architecture herein, where each bar chart has left-side bars for host CPU MIPS in given operations and right-side bars for a particular additional device in the given operations;

FIG. 15 is a set diagram with circles each representing a component of an improved system combination, the circles having overlapping regions indicating coupling elements and processes;

FIG. 16 is a process diagram or method-of-operation diagram showing interrelated improved processes and structure in a network videoconferencing and full-featured system of FIG. 1;

FIG. 17 is a block diagram and layout diagram of an improved DSP (digital signal processor) integrated circuit embodiment having a wrapper-and-DSP-core (called VSP herein) and a serial bus backend interface on-chip, the improved integrated circuit connected to busses for some system embodiments herein;

FIG. 18 is a process diagram or method-of-operation diagram showing interrelated improved processes called DirectDSP, DirectDSP HEL (host emulation), DirectDSP HAL (hardware abstraction layer), and VSP Kernel (DSP Real-Time Kernel) herein;

FIG. 19 is an electrical block diagram of an upgradable VSP with overdrive socket in another embodiment of an improved computer system for stereo, telecom, and voice;

FIG. 20 is an electrical block diagram of an upgraded VSP system in another embodiment of an improved computer system for stereo, telecom, and voice;

FIG. 21 is a block diagram and layout diagram of an improved VIEW (very long instruction word) DSP (digital signal processor) integrated circuit embodiment having a wrapper-and-DSP-core (called VSP herein), the improved integrated circuit connected to system embodiments herein;

FIG. 22 is a block diagram and layout diagram of an improved DSP (digital signal processor) integrated circuit north bridge embodiment having a wrapper-and-DSP-core (called VSP herein) and a serial bus backend interface on-chip, the improved integrated circuit connected to ports and busses for some system embodiments herein;

FIG. 23 is an electrical block diagram showing a improved computer system embodiment and its buses, couplings and interconnection for sound, disk, codec and other system components;

FIG. 24 is a process or method flow chart diagram of software product manufacture and use, including parallel compiles of granules, granule allocation process, selective execution of granules and DRAM common data structure;

FIG. 24A is a library of tables for software application programs respectively, each table for a given program having entries for corresponding granules in the program, each granule entry including granule ID, a set of system impact descriptors for the granule, and an associated default host/DSP entry and dynamic host/DSP entry;

FIG. 24B is a process or method flow chart diagram of a portion of a DirectDSP embodiment using the library of FIG. 24A and allocation logic operations for performing resource management and dynamic load balancing for systems herein;

FIG. 25 is a process diagram or method-of-operation diagram showing interrelated improved processes related to DirectX and 32-bit WDM operating system, the improved processes called DirectDSP WDM, DirectDSP HEL, DirectDSP HAL, and VSP Kernel herein;

FIG. 26 is a process or method flow chart diagram of a portion of a DirectDSP embodiment improved for loading audio and modem applications;

FIG. 27 is a process diagram or method-of-operation diagram showing interrelated improved processes related to operating system, DirectDSP HAL, and VSP Kernel herein;

FIG. 28 is a diagram of memory spaces representing a shared memory model utilized in embodiments of processes, devices and systems herein;

FIG. 28A is an electrical circuit diagram of interrupt-related registers and interrupt lines to the PCI bus and to the DSP, used in process, device and system embodiments;

FIG. 29 is a diagram of interrupt levels utilized in connection with hardware interrupts and deferred procedure calls (DPCs) in process, device and system embodiments;

FIG. 30 is a further diagram of interrupt levels over time utilized in connection with hardware interrupts and deferred procedure calls (DPCs) in process, device and system embodiments;

FIG. 31 is a classification diagram of interrupt levels in real-time and dynamic classes in connection with process, device and system embodiments;

FIG. 32 is a further diagram of interrupt priority levels over time in process, device and system embodiments;

FIG. 33 is an electrical block diagram combined with a process or method flow chart diagram depicting VSP Kernel operations on audio applications;

FIG. 34 is a further diagram of interrupts over time in process, device and system embodiments having a bus master interrupt service routine (ISR) and a transmit ISR during a sound task involving PCI request processing;

FIG. 35 is a further diagram of interrupts over time in process, device and system embodiments having multiple bus master ISRs during a sound task involving a PCI request with multiple PCI transactions;

FIG. 36 is a memory space diagram of host memory program and data spaces (at left) and DSP on-chip and off-chip memories (at right) representing an example of a shared memory model utilized in embodiments of processes, devices and systems herein;

FIG. 37 is an electrical block diagram combined with a process or method flow chart diagram depicting VSP Kernel operations on audio applications, similar to FIG. 33 and showing a DirectSound task in more detail;

FIG. 38 is a DSP memory space diagram supplementing FIG. 36-right and showing DSP program, data and I/O spaces, including on-chip and off-chip memories and registers utilized in embodiments of processes, devices and systems herein;

FIG. 39 is a memory space diagram of host memory program and data spaces (at top) and DSP memory space (at bottom) representing an example of handles and data structures in the shared memory model of FIG. 36 utilized in FIG. 33 sound-related embodiments of processes, devices and systems herein;

FIG. 40 is a process or method flow diagram depicting multiple stereo audio task operations and mixing of sources having different data rates;

FIG. 41 is a memory space diagram showing improved coupling between Host spaces, PCI spaces, and DSP spaces in system embodiments.

FIG. 42 is a more detailed process or method flow diagram depicting audio mixing and the audio output buffers in the lower part of FIG. 40;

FIG. 43 is a real-time-flow diagram of four processes (PCI Bus Master ISR, DSP Message Handler, Audio Out Task, Mixer ISR) in the audio process of FIG. 33 in an example of single-tasking VSP kernel execution;

FIG. 44 is a flow chart diagram of an example of message processing, combined with a memory space diagram of host memory (at top) and DSP memory (at bottom) representing an example of handles, objects and data structures in the shared memory model of FIG. 36 utilized in FIG. 33 wave-sound and other embodiments of processes, devices and systems herein;

FIG. 45 is an electrical block diagram of a VSP (wrapper/DSP) embodiment having DSP bypass, and coupled for both modem and audio in a system embodiment operated according to a method embodiment;

FIG. 46 is a process diagram or method-of-operation diagram showing interrelated improved processes related to DirectDSP improved modem operation under Windows95, Windows 3.1, and DOS of various system embodiments;

FIG. 47 is an electrical block diagram of a printed circuit add-in card reduced essentially to physical layer elements, and connected to a DSP-enhanced computer motherboard according to methods herein for various system embodiments;

FIG. 48 is an electrical block diagram of a system embodiment having a VSP-based combined audio controller and modem according to methods herein;

FIG. 49 is an electrical block diagram of interconnections between a wrapper ASIC, a DSP and a stereo codec in a system embodiment;

FIG. 50 is a more detailed electrical block diagram of the system of FIG. 49 including a block diagram of circuitry in the wrapper ASIC;

FIG. 51 is an electrical block diagram overview of the system of FIG. 50 such as a DSVD system;

FIG. 51A is an electrical block diagram showing address and control lines interconnecting the wrapper ASIC, a DSP and two SRAM chips in a system embodiment such as in FIG. 50;

FIG. 51B is a waveform diagram of DSP clock, addresss, data, and output enable control signaling in a system embodiment such as in FIG. 50;

FIG. 52 is a simplified electrical block diagram emphasizing a dual port memory DPRAM operated in part as a ping-pong buffer in the wrapper ASIC with a wrapper voice codec interface in a system embodiment such as in FIG. 50;

FIG. 52A is a state transition diagram describing a process of operation of a voice codec DMA state machine (SM) interface in the wrapper ASIC of FIGS. 50 and 52;

FIG. 53 is an electrical block diagram of a circuitry embodiment coupling a wrapper ASIC DPRAM to a PCI macro, or block, and showing ASIC control registers read/writeable by DSP in a potion of the wrapper ASIC embodiment of FIG. 50;

FIG. 54 is an electrical block diagram of wrapper ASIC DPRAM split into four byte-parts and used to describe a process or method of byte steering, operating address counters, and operating byte strobes in the wrapper ASIC for stream I/O between a host CPU and host memory operating on 32-bit nonaligned data and a DSP operating on 16-bit word-aligned data;

FIG. 54A is a partially-schematic, partially real-time process flow diagram of an eight-byte read with byte alignment in an example using 3 PCI data phases in the process of FIG. 54;

FIG. 54B is a partially-schematic, partially real-time process flow diagram of a nine-byte read with byte alignment in an example using 3 PCI data phases in the process of FIG. 54;

FIG. 54C is a partially-schematic, partially real-time process flow diagram of a five-byte read with byte alignment and byte padding in an example using 2 PCI data phases in the process of FIG. 54;

FIG. 54D is an electrical block diagram of the PCI configuration registers in PCI configuration space of FIG. 128, their address decodes and read or read/write circuits associated with those configuration registers in the wrapper ASIC of VSP;

FIG. 54E is an electrical block diagram of PCI I/O space registers in PCI I/O space of FIG. 128, their address decodes and write or read/write circuits associated with those I/O space registers in the wrapper ASIC of VSP;

FIG. 54F is an electrical block diagram of an address translation circuit and its method of operation in the wrapper ASIC to translate DSP 16-bit word-oriented addresses from the various DSP address spaces of FIG. 38 to a PCI address, wherein the selected DSP address (can be shifted left by one place to multiply by 2 if 0x57 bit 8 413 calls for word transfer) is then added to an address offset, whereupon a cache line (16 bytes from host main memory having the resultant PCI address as lowest address) is transferred to the location defined by the DSP address in the particular one of the various DSP address spaces;

FIG. 54G is a state transition diagram for a PCI transaction state machine for coupling a TI TMS320C52 DSP for FIFO reads and writes from/to PCI bus according to a read sequence and/or write sequence detailed in the incorporated U.S. patent application TI-Ser. No. 08/823/25;

FIG. 54H is a state transition diagram for a PCI transaction state machine for wrapper ASIC of VSP;

FIG. 54I is an electrical block diagram of PCI host accessible registers starting at base address BA0 in PCI I/O space and replicated and starting at base address BA1 in PCI memory space of FIG. 128, (BA0, BA1 defined in PCI configuration register 0x10, 0x14), and FIG. 54I further indicates address offset decodes and read or read/write circuits associated with those PCI host accessible registers in the wrapper ASIC of VSP;

FIG. 54J is a state transition diagram of a process or method of operation of a stereo audio codec state machine in the wrapper ASIC of FIG. 50;

FIG. 54K is an electrical schematic diagram of a D-latch representing any bit of PCI interrupt register 0x04 illustrated thereabove, and associated control circuitry to controllably OR a given interrupt with the one/zero in the D-latch;

FIG. 54L is a state transition diagram of a process or method of operation of a EEPROM state machine in the wrapper ASIC of FIG. 50;

FIG. 54M is a timing diagram of a process or method of operation of the EEPROM state machine EESM in the wrapper ASIC of FIG. 50;

FIG. 55 is an electrical schematic diagram of a D-latch (upper right) representing any bit which is shared between DSP and the host as in PCI voice codec register 0x16, and associated control circuitry and methods of operation;

FIG. 55A is a timing or waveform diagram of a process or method of operation of each shared register bit in the wrapper ASIC of FIG. 50;

FIG. 56 is a state transition diagram of a process or method of operation of a state machine in the wrapper ASIC of FIG. 50;

FIG. 57 is a timing or waveform diagram of a process or method of operation of the memory arbitration MARB in the wrapper ASIC of FIG. 50;

FIG. 57A is a memory space diagram of host main DRAM memory showing memory allocation and pages locked during initialization in a shared memory model method and system embodiment;

FIG. 57B is a memory space diagram of host main DRAM memory showing memory allocation and pages scatter-locked in a shared memory model method and system embodiment for source/destination data DMA transfers;

FIG. 57C is a memory space diagram of host main DRAM memory showing memory allocation and regions locked in a shared memory model method and system embodiment for source DMA transfer table;

FIG. 57D is a memory space diagram of host main DRAM memory showing a page list structure in a shared memory model method and system embodiment for stream I/O processing;

FIG. 57E is a memory space diagram of host main DRAM memory showing memory allocation and regions locked in a shared memory model method and system embodiment for destination DMA transfer table;

FIG. 57F is a memory space diagram of host main DRAM memory showing a DSP message queue and a host message queue with host manipulated head and tail pointers on the left side, and DSP manipulated head and tail pointers on the right side;

FIG. 58 is a state transition diagram of a DMA Write portion of DSP DMA SM state machine hardware and its process shown in FIGS. 61, 58 and 59 for the wrapper ASIC of FIG. 50;

FIG. 59 is a state transition diagram of a DMA Read portion of DSP DMA SM state machine hardware and its process shown in FIGS. 61, 58 and 59 for the wrapper ASIC of FIG. 50;

FIG. 60 is a waveform diagram illustrating timing and method for read to local off-DSP SRAM external to wrapper ASIC in FIG. 62;

FIG. 61 is a state transition diagram of an entry portion of a DSP DMA SM state machine hardware and its process shown in FIGS. 61, 58 and 59 for the wrapper ASIC of FIG. 50;

FIG. 61A is a state transition diagram of a portion of a DMA channel steering SM state machine hardware and its process for the wrapper ASIC of FIG. 50;

FIG. 62 is an electrical block diagram of circuit blocks and control lines in the wrapper ASIC of FIG. 50 coupling to DSP and SRAM;

FIG. 63 is a waveform diagram illustrating timing and method for write to local off-DSP SRAM external to wrapper ASIC in FIG. 62;

FIG. 64 is a block diagram of a DSP Interrupt Register 0x51;

FIG. 65 is an electrical schematic diagram of a D-latch representing any bit of DSP Interrupt Register 0x51 illustrated in FIG. 64, and associated control circuitry to controllably OR a given interrupt with the one/zero in the D-latch;

FIG. 66 is an electrical schematic diagram of a D-latch representing any bit of DSP I/O Registers 0x50, 0x52-0x6F (FIG. 38) in the wrapper ASIC, and associated control circuitry to supply DSP Data In to put a one/zero in the D-latch;

FIG. 67 is a diagram of wrapper ASIC DPRAM memory space for DSP bootload purposes, the memory space pointed to by an SRC address of FIG. 70;

FIG. 68 is an address space comparison diagram showing host data in host address space, and corresponding data in DSP address space in a method embodiment;

FIG. 68A is an electrical block diagram of circuitry and method for DSP read of wrapper ASIC DPRAM via I/O space for C54x bootload, for instance;

FIG. 69 is an electrical schematic diagram of a circuitry and method embodiment for producing a READY signal for wrapper ASIC DPRAM read operations;

FIG. 70 is an electrical block diagram of a register used in the ASIC wrapper for DSP bootload purposes, and having an address SRC pointing to the data structure of FIG. 67, and the register also having a code for EPROM mode;

FIG. 71 is a waveform diagram illustrating a method of operating the DSP and circuitry of FIGS. 72-1 and 72-2 to interface a DSP to the wrapper ASIC DPRAM;

FIGS. 72-1 and 72-2 are both halves of an electrical schematic diagram of a zero-wait-state read interface circuit and method embodiment coupled between wrapper ASIC DPRAM and a DSP;

FIG. 73 is an electrical schematic diagram showing the SDA, SDL pin interface of wrapper ASIC to EEPROM;

FIG. 74 is an electrical block diagram showing how DSP registers, voice codec state machine, and interrupt generation logic have transmit/receive ping/pong lines connected in wrapper ASIC shared registers 0x16, 0x18, 0x5C, 0x5D;

FIGS. 75A and 75B are both halves of a pinout diagram for the VSP wrapper ASIC;

FIG. 76 is a process diagram or method-of-operation diagram showing interrelated processes in a Windows95 display driver interface for unified signal processing improvements herein;

FIG. 77 is a process diagram or method-of-operation diagram showing interrelated advanced graphics port (AGP) processes for unified signal processing improvements herein;

FIG. 78 is a process diagram or method-of-operation diagram showing interrelated DirectX processes, HAL display driver interfaces and hardware for unified signal processing improvements herein;

FIG. 79 is a process diagram or method-of-operation diagram more specifically showing interrelated processes in a 3D graphics process architecture and interface for unified signal processing improvements herein;

FIG. 80 is a process diagram or method-of-operation diagram more specifically showing interrelated processes in a DirectDraw driver interface for unified signal processing improvements herein;

FIG. 81 is a process diagram or method-of-operation diagram showing interrelated 16-bit and 32-bit processes in a DirectDraw driver interface for unified signal processing improvements herein;

FIG. 82 is an electrical block diagram of components and architecture of an improved USB universal serial bus-connected system embodiment improved by unified signal processing herein;

FIG. 83 is an electrical block diagram of a system embodiment with improved VSP south bridge and VSP integrated circuits interconnected by a serial bus as well as PCI bus;

FIG. 84 is an electrical block diagram of components and architecture of an improved real-time private bus-connected VSP-graphics/video chip and VSP-comm-audio-cardbus chip in a system embodiment improved by unified signal processing herein;

FIG. 85 is an electrical block diagram of components and architecture of an improved real-time private bus-connected graphics/video chip and VSP-comm-audio-cardbus in a further improved multimedia system embodiment improved by unified signal processing herein;

FIG. 86 is a process diagram or method-of-operation diagram showing interrelated improved processes in a USB serial bus-based system improved with unified signal processing;

FIG. 87 is a process diagram or method-of-operation diagram showing interrelated improved processes in a WDM accelerator with digital audio and embedded VSP serial bus hub with unified signal processing herein;

FIG. 88 is an electrical block diagram and/or method-of-operation diagram showing interrelated blocks and processes for coupling VSP to USB serial bus in system embodiments improved with unified signal processing herein;

FIG. 89 is an electrical block diagram and/or method-of-operation diagram showing interrelated blocks and processes for a serial bus hub in system embodiments such as in FIG. 82 improved with unified signal processing herein;

FIG. 90 is a process diagram or method-of-operation diagram showing interrelated improved processes in a DVD digital video disk for unified signal processing improvements herein;

FIG. 91 is a process diagram or method-of-operation diagram showing interrelated improved processes in sound-related driver and HAL interface technology using unified signal processing ActiveDSP, DirectDSP and VSP herein;

FIG. 92 is a process diagram or method-of-operation diagram emphasizing interrelated improved processes in an ActiveDSP level of FIG. 92 in system embodiments;

FIG. 93 is a process diagram or method-of-operation diagram emphasizing data streaming aspects of interrelated improved processes in an ActiveDSP level of FIG. 92 in system embodiments;

FIG. 94 is a process diagram or method-of-operation diagram emphasizing a shared memory model coupling interrelated improved processes of DirectDSP HAL and DSP Kernel in system embodiments;

FIG. 95 is a process diagram or method-of-operation diagram emphasizing DSP task object structure in the shared memory model of FIG. 94 in system embodiments;

FIG. 96 is a process diagram or method-of-operation diagram showing interrelated improved processes at ring 3 and ring 0 levels of privilege in sound-related driver processes using unified signal processing improvements herein;

FIG. 97 is a process diagram or method-of-operation diagram showing interrelated improved processes at ring 3 and ring 0 levels of privilege in sound-related driver processes using unified signal processing improvements herein, and showing a different way of handling kernel mode clients compared to FIG. 96;

FIG. 98 is a process diagram or method-of-operation diagram showing interrelated improved processes in MIDI multimedia driver interface using unified signal processing improvements herein;

FIG. 99 is a another process diagram or method-of-operation diagram showing interrelated improved processes in MIDI multimedia driver interface with wave tables using unified signal processing improvements herein;

FIG. 100 is a process diagram or method-of-operation diagram showing interrelated improved processes in a WDM (32-bit Windows Driver Model) for data streaming using unified signal processing improvements herein;

FIG. 101 is an electrical block diagram and/or method-of-operation diagram showing a 2-channel MPEG audio decoder to run on VSP and have other unified signal processing improvements herein;

FIG. 102 is a process diagram or method-of-operation diagram showing interrelated processes and virtual sound blaster SB and a 16-bit and 32-bit WDM DirectSound multimedia (MM system) installable driver environment for unified signal processing improvements herein;

FIG. 103 is an electrical block diagram and/or process diagram showing combined audio and modem functions in a VSP system embodiment;

FIG. 104 is a process diagram or method-of-operation diagram showing interrelated processes and structures in a telephony driver (TAPI telephony API) and wave driver architecture for unified signal processing improvements herein;

FIG. 105 is a process diagram or method-of-operation diagram emphasizing (compared to FIG. 104) interrelated processes for interfaces to telephone line, NDIS WAN (network driver interface specification, wide area network), and serial buses in kernel mode for unified signal processing improvements herein;

FIG. 106 is a process diagram or method-of-operation diagram showing interrelated processes in a Windows95 virtual communications driver model for unified signal processing improvements herein;

FIG. 107 is a process diagram or method-of-operation diagram showing interrelated processes in Windows95 voice-line communications for unified signal processing improvements herein;

FIG. 108 is a process diagram or method-of-operation diagram showing interrelated processes in a Windows95 RAS (remote access service and PPP (point-to-point protocol internet dialup) for unified signal processing improvements herein;

FIG. 109 is a process diagram or method-of-operation diagram showing interrelated improved processes in a Windows95 unimodem and driver interface for unified signal processing improvements herein;

FIG. 110 is a process diagram or method-of-operation diagram showing interrelated improved data flow-processes in a combined Windows95 unimodem, telephony, wave driver and pumpless modem model for unified signal processing herein;

FIG. 111 is a pictorial diagram of a VSP add-in card or printed wiring board with wrapper ASIC, DSP (C54x), two SRAMs, MAFE (modem analog front end) and connector jacks;

FIG. 111A is another pictorial diagram of a VSP add-in card or printed wiring board with wrapper ASIC, DSP (C54x), SRAMs, codecs, daughter card and connectors;

FIG. 111B is a detail diagram of a card connector for the VSP add-in card of FIG. 111A;

FIG. 112 is a process diagram or method-of-operation diagram including state transitions in a Windows95 Unimodem V interface for unified signal processing improvements herein;

FIG. 113 is a simplified process diagram or method-of-operation diagram showing interrelated improved processes for data and voice for unified signal processing improvements herein;

FIG. 114 is a process diagram or method-of-operation diagram showing interrelated improved processes in a PPP NDIS driver for unified signal processing improvements herein;

FIG. 115 is a process diagram or method-of-operation diagram showing interrelated improved processes for telephony and networking (including ISDN integrated services digital network, and xDSL digital subscriber line) in a driver interface using unified signal processing, with PPP NDIS driver shown in FIG. 114;

FIG. 116 is a process diagram or method-of-operation diagram summarizing interrelated improved TAPI, PPP and NDIS WAN processes for unified signal processing improvements herein;

FIG. 117 is an electrical block diagram and/or process diagram showing RAS client and RAS server coupled by DSL WAN for unified signal processing improvements herein;

FIG. 118 is a process diagram or method-of-operation diagram showing interrelated improved processes in MDSL WAN system for unified signal processing improvements herein;

FIG. 119 is a process diagram or method-of-operation diagram showing one process embodiment for dynamic balancing of a system embodiment herein;

FIG. 120 is a process flow diagram or method-of-operation diagram showing linking of a granule and launching of a software application according to improvements herein;

FIG. 121 is a process diagram or method-of-operation diagram showing improved operations loading a Host and/or loading a VSP subsequent to FIG. 120 operations;

FIG. 122 is a process diagram or method-of-operation diagram showing interrelated improved processes wherein multiple VSPs are coupled to and supply VSP MIPS-load information for the improved DirectDSP process to do unified signal processing;

FIG. 123 is a process diagram or method-of-operation diagram showing interrelated improved processes wherein multiple VSPs are coupled to improved DirectDSP process to do unified signal processing involving task allocation to the multiple VSPs;

FIG. 124 is a process diagram or method-of-operation diagram showing an improved process for speed scaling of VSP by host using unified signal processing improvements herein;

FIG. 125 is another process diagram or method-of-operation diagram emphasizing improved process coordination with DirectX showing improved operations loading a Host and/or loading a VSP subsequent to FIG. 120 operations;

FIG. 126 is an electrical block and/or process diagram showing a VSP-improved north bridge coupled to VSP bus, to Host CPU, to Main Memory, to AGP port and AGP chip, and to PCI bus with PCI agent(s) thereon in system embodiments; and

FIG. 127 is another electrical block and/or process diagram emphasizing data paths in a VSP-improved north bridge coupled to VSP bus, to host CPU, to Main Memory, to AGP port and AGP chip, and to PCI bus with PCI agent(s) thereon in system embodiments.

Corresponding numerals and symbols in the different figures refer to corresponding parts unless otherwise indicated.

Signal Processing in a Multimedia PC

Given an optimal way to deploy a "pool" of MIPS available in a computer system at any given time, a dynamically balanced system as described herein distributes and/or re-allocates its collective computational resources to satisfy a broad range of functional requirements on-the-fly. By comparison, a statically balanced system fails to perform some combinations of tasks even though there may be large "pools" of unused trapped MIPS in particular chip(s) in the system. This is actually a not uncommon occurrence. With a dynamic balance, computational resources within the system are linked at run-time and allocated by the operating system, providing a much greater flexibility for resource scheduling.

Scalability impacts balance herein. Scalability suggests that applications or media processing tasks adapt to instantaneous or long term change in the availability of system computational resources. Different types of functions or applications respond differently to upward and downward scaling.

Upward scaling is generally a positive phenomenon, though not all functions can take advantage of it. Either by upgrading the CPU, or accelerating a CPU-bound function, additional MIPS become available to the system. Performance down-scaling occurs when host MIPS are consumed by an increasing number of concurrently running tasks. Some functions handle downward scaling gracefully, while others catastrophically fail.

Down scaled performance is an annoyance in recalculating a spreadsheet. But for decoding a movie, using Internet telephony, or tele-gaming, downward scaling means losing real world data and compromising quality of service and accuracy. When real-time media streaming functions lack enough MIPS to run, catastrophic failure results.

A statically balanced system does not prevent non-scalable real time functions from failing and scalable operations do not scale upward even though unused MIPS exist in the system.

The kernel of evolving Windows operating system (OS) and device driver models and the Application Programming Interface(API) for multimedia peripherals and data types is embodied in DirectX, ActiveX and WDM. The OS is herein improved for balance and scalability by coordinating abstraction, visualization and emulation.

Windows OS is device-independent. A variety of differentiated modular fixed-function physical hardware peripheral devices are adapted to Windows through abstraction via a thin layer of Hardware Abstraction Layer (HAL) software (also called device drivers) in Windows. Through abstraction, the OS and application need not care what brand of graphics accelerator, audio chip, modem or printer are resident in the system.

Once the system peripherals have been abstracted in software, the basic hardware peripherals in the system are virtualized for advanced multitasking. Some software utilized for virtualization herein is located in the core of the Windows OS--the Windows Virtual Machine Manager (VMM) and Virtual Device Drivers (VxD).

The Windows OS software creates a separate software instantiation (or abstraction) of a complete system, or Virtual Machine (VM), for each application running concurrently. Each application uses and changes the state of its own virtual machine (virtual peripherals, virtual memory, etc.) independently of other tasks.

Abstraction provides the OS with device independence, and device emulation delivers hardware independence. Windows APIs establish uniform program access to acceleration hardware, while host emulation allows the API to operate correctly even if acceleration hardware is absent.

Peripheral hardware emulation relies on CPU computational resources rather than fixed function resources. A powerful host CPU within the system, running the appropriate code, is functionally indistinguishable from a fixed function peripheral. Within the limits of the CPU computational resources, emulated functions are synthesized, suspended or eliminated at will.

When an emulated peripheral function is no longer required, it desirably ceases to consume host MIPS, while fixed function MIPS cannot be re-allocated.

Although host emulation is useful, flexibility is constrained, and the host CPU may stall due to system imbalance when the virtualization and emulation capabilities of the OS can only be directed to the host CPU.

A system which uses the host exclusively for emulation is not balanced. As each emulation task robs performance from the applications and OS which spawned them, host emulation of one or more complex media processing functions can quickly bring the system to its knees. Since device emulation code is mutually exclusive or non-concurrent with the execution of application or operating system code, host emulation forces downward scaling of all other active applications or functions.

A multimedia extension MMX single instruction multiple data (SIMD) unit inside the CPU can accelerate host emulation of some of the more real-time applications such as video and to some extent parallel pixel operations, using x86 emulation code ported to MMX code. However, issues include inefficient physical partitioning, integration, and concurrency of highly specialized processing elements. Since MMX is on-host and on-chip it competes directly with other x86 processing units for system resources.

In some of the embodiments herein called Unified Signal Processing (USP), the Windows OS is improved for OS directed device emulation, dynamic control, reconfiguration and allocation of system resources. Host emulation is augmented by distributed and asymmetrical device emulation acceleration. (Asymmetrical devices have different instruction sets or architecture.) Balanced system resources prevent or alleviate bus (CPU, memory and I/O) overloading, memory and I/O bottlenecks, and CPU stalls. By properly distributing computational resources in the system, device emulation tasks are directed by the OS to run on any appropriate processing elements to achieve balance.

In some improved system embodiments, the OS controls multiple modular, stackable, concurrent computational resources (processors or hardware accelerators), and the improved system supports a wider variety of multimedia device emulation tasks. Modularity adds processing MIPs or elements, and the improved system gracefully orchestrates their operation with the host CPU/MMX for audio, video, graphics, communication and other functions. These modular and distributed processing elements in the improved system can better control latency for real-time events.

VSP Hardware

VSP or VSP interface is a logic wrapper around a digital signal processor (DSP) core, that interfaces the DSP with the PC via the PCI/AGP Bus or PC system core logic.

Backend interface logic enables VSP to become an intelligent hub or bridge to universal serial bus (USB), IEEE 1394".

Host-independent PCI I/F allows VSP to be integrated with other system functions or reside on an add-in cards (PCI or PC card).

Advanced CPU Architecture

Advanced CPU architecture with multiple Processing Elements (PEs).

The main PE is the x86 CISC core. Other PEs are implemented as VSPs.

VSP1 is the MMX core (as in the Pentium and Pentium Pro designs). VSP2 is a very long instruction word (VLIW) core and VSP3 is a RISC core etc.

Coprocessor bus couples to VSPs.

Superscalar extension with VSP(s) on the coprocessor bus.

Shared memory architecture with Distributed AMP and out-of-order execution on mem.tran.boundary.

All processors and bus suitably fabbed on single chip.

ARM With VSP Coprocessors

Coprocessor bus couples with VSPs.

Superscalar extension with VSP(s) on the coprocessor bus.

Shared memory architecture.

Distributed AMP.

VSP uses C54x Core and follow on DSPs

OS independent: Java or Windows CE.

All processors ARM+VSP suitably fabbed on single chip.

Implementation 1: Add-in card

The USP architecture suitably utilizes any bus interface. USP with a PCI interface is easily implemented as an external PCI adapter card or cardbus PC card. Functional integration with a PCI graphics video controller, card bus, IEEE 1394, communications (com) and/or audio controller are possible.

Implementation 2: Core logic integration (motherboard or planar)

The USP architecture integrates a VSP(s) into the PC such that a VSP is embedded into the north Bridge, south bridge and super I/O core logic. Functional integration with a 3D graphics/video controller, comm block and/or cardbus controller are feasible too.

Implementation 3: CPU integration (motherboard or planar)

Like MMX, VSP(s) are integrated on-chip, e.g a P7 with a VLIW VSP block.

Implementation 4: External to PC box

VSPs are suitably provided on IEEE 1394link layers, USB hubs, xDSL (digital subscriber line) modems and Internet/Intranet.

USP is cost effective by intelligently distributing processing requirements between the host and VSP. Various USP improvements avoid overhead associated with a standalone DSP system and its inherently inefficient host-to-DSP (and vice versa) communication. Therefore, under the USP architecture, new media applications are performed more efficiently with less MIPS and memory. The ultimately translates to lower system costs. This efficiency results from applying the most optimal processing architecture for various tasks of a new media application, and intelligently offloading the host to optimally use the host and VSP resources. In addition, VSP accesses host resources (e.g. virtual memory) while intelligent memory schemes are employed to address system cost. The VSP hardware as part of the host resources and can directly be integrated with I/O and pad-bound system core-logic for cost reduction.

USP provides full time functionality integrated to the PC architecture. USP buys back host MIPs where the host is in high demand, and provides reusability by helping with host functions when not processing multimedia tasks. USP permits true bi-directional scalability of system hardware (in either the host or DSP direction) when an application opens. System can be rescaled when the application closes, whereby USP truly enables virtual hardware. Expandability through a distributed rescalable architecture with asymmetrical multi-processing leads to embodiments with multiple VSPs on multiple buses (PCI, AGP, IEEE 1394, USB etc) or integrated with system core-logic to multi-process on task execution. USP's COM-based S/W allows gradual porting of baseline host code to VSP code such that complicated DSP algorithms may be developed in C and piecemeal ported to VSP code as DSP COM objects or threads.

SUMMARY OF SOME EMBODIMENTS

A conventional x86 PC having a bursty bus such as PCI has multimedia performance improved by adding application specific integrated circuit (ASIC) "wrapper" circuitry to smooth out the data transfers into a desired stream-like flow of multimedia data. The data transfers are from host (system) memory to ASIC "wrapper" buffer memory for VSP consumption and vice versa.

The smoothing-out function is accomplished by "wrapper" byte-channeling logic as follows. Dword (4 bytes) data transfers take place in bursts on the PCI bus. In multimedia data the first byte may be anywhere in the Dword (i.e. one out of 4
possible locations). From the address of the first byte in host memory and the "wrapper" memory address for storing the first byte, the shift factor (represented by two control bits) for mapping host bytes correctly into 16-bit VSP word format can be determined. The control bits along with the length of the transfer (in bytes or words) are used to perform data shifts according to the shift factor (implemented with data multiplexers) for unpacking the host Dwords into 16-bit VSP word format. In this way the VSP enjoys a transparent 32-bit to 16-bit data format conversion with the correct starting byte. This saves about 7 VSP instructions (minimum of 7 clocks with no wait states) per byte transfer and saves even more host clocks.

Associated with the ASIC wrapper circuitry is a DSP which adds substantial computing power to the system, especially because the DSP is already architected for modem, voice, audio, and imaging/video processing. This VSP is the wrapper/DSP combination and this ASIC wrapper is known as the VSP wrapper ASIC. A VSP to be used as a graphic accelerator does not need a different wrapper ASIC circuit architecture compared to a modem/audio VSP except insofar as some fine tuning of memory size may be desired. A frame buffer is provided external to the wrapper either separate from or unified with host system memory. On the other hand, additional features can be added to an existing VSP wrapper to enhance its functionality to take advantage of unique system configurations/component features.

Legacy architecture and IEEE 1394 peripherals can require the PCI bus to carry video data. Where an IEEE 1394 camera is used for image/video capturing and the output of the camera is to be stored in the PC system, the VSP can first perform image/video data compression to prevent undue PCI bus congestion then bus-master the data across the PCI bus to host memory further relieving the host of the I/O chore. Conversely in a video/image playback function, the VSP can bus-master compressed MPEG/JPEG data from the host memory across the PCI bus to avoid congestion of the PCI bus. The VSP can then decompress the MPEG/JEG data and pass the video/image data via a zoom video private bus directly to the frame buffer of the graphics/video adapter without congesting the PCI bus unduly.

The VSP interleaves processing with bursting of data and overcomes the PCI bus latency issue. A PCI agent may have to wait, for example, 2 microseconds on average because of PCI bus latency due to other PCI agents using the bus. The VSP can be advantageously processing data in this time interval while dovetailing or interleaving its processing with the PCI operations. This is not mere buffering because DSP processing is transforming data to useful outputs during the latency period.

In an architecture where no video is carried on the PCI bus, a VSP used as a graphic accelerator is still important because it is then advantageously provided either at the North Bridge or AGP graphics/video chip location so that advantageous MIPS are provided without substantially loading the PCI bus. For instance, DSP MIPS can be advantageously allocated to texture map decompression at either end of AGP. There need not be limitations on amount of texture stored in main memory, as suffered hitherto.

The VSP wrapper does not constitute a new bottleneck because the data conveyed to it from the PCI bus will generally be in some compressed form requiring DSP processing such that the wrapper is conveying data through a smaller bandwidth across the PCI bus thereby alleviating bus congestion. After processing, this data will then be passed out the back end with higher bandwidth. Advantageously, the VSP works in compressed data space. The VSP is situated in a place where no bottleneck is introduced because the VSP is located where the video, audio, or serial output is situated. By contrast, the host may be located too far away from the I/O peripherals and on the wrong side of the PCI bus to solve bottleneck problems that the VSP advantageously solves.

At first glance, it might appear that VSP modem/audio processing might relieve the host of only an inconsequential 0.5 Mbyte/s (48 KHz AC-3.times.6 channels.times.2 bytes/channel) I/O function over the 32/64-bit 33/66 MHz pCI bus where the host can easily do the I/O processing. Actually, however, every application has compute, memory and I/O requirements. The memory and I/O bandwidth issues are indeed somewhat secondary in audio and modem. The burden is mostly in the compute area, especially in new media applications such as softmodem, AC-3 and 3D positional audio. Pentium needs 50 MHz for soft-modem and 20-30 MHz for AC-3, for example. While accessing video/audio files, opening zip files, diverting modem data to LAN, may not be extremely compute intensive, making/sustaining modem connection, performing data pump code, computing head-related transfer functions and 3D positioning are all highly compute intensive. In worst case, the video freezes up when the system is overloaded. And the memory and I/O requirements are not trivial. The host has to be fed with PCI bus raw audio data traffic and intermediate memory accesses (64-bit with padding to boot) before it can do the computing. Since these new media applications often entail non-cacheable data, the host L1 and L2 caches will frequently be thrashed which is not an optimal way of using caches. This is simply an inefficient use of host MIPS when the VSP has specialized multimedia instructions and is better situated architecturally to handle the applications. The host as an expensive, centralized single chip simply cannot be distributed over outlying computing locations in the PC system architecture that a far more inexpensive VSP(s) can advantageously service at the I/O locations. Simply increasing host CPU computing power in successive generations only exacerbates the bottleneck problems to the point of stalling the host CPU, unless these bottlenecks are relieved by the appropriate VSP(s).

The VSP wrapper is not redundant to the audio, MIDI or graphics interface because it replaces and permits virtualization of major hardware elements that have to be purchased today. The VSP wrapper (and even the VSP as a whole i.e. wrapper/DSP) offers modular circuitry available to integrate essentially for free on the spare die (or spare gates) real estate that hitherto have existed in the I/O bound and bond-pad-limited North Bridge and South Bridge chips.

VSPs provide plenty of DSP MIPS to differentiate new designs from those based only on the main microprocessor i.e. host CPU. For example, a 233 MHz Klamath processor with 2 instructions/cycle may offer 400-500 host MIPS and can do 30 frames/sec DVD decoding (AC-3 audio and MPEG-2 video) entirely in software. Hardware assists for Klamath (and other host CPUs) at I/O locations are, however, needed. The VSP approach not only provides these hardware assists but also leverages DSP MIPS to do more than the same number of host MIPS can do. This leveraging can be measured in raw MIPS, effective MIPS, and bandwidth reduction.

The DSP MIPS permit compressed data to travel on PCI bus, advantageously preventing congestion thereon and consequent host processor stalls. A TI DSP such as one of the TMS320C5x family provides up to 100 MIPS and future members of the C54x DSP can go up to 500 MIPS. A C6x DSP provides up to 1600 MIPS. Even though any benchmark is a debatable comparison, the DSP computing power is clearly comparable (if not more powerful) to host computing power for specific multimedia functions. No fixed CPU architecture is perfect for every application, and therefore the ability to optimally allocate MIPS over the host CPU and various VSPs in the proposed dynamic or transformable USP architecture helps it approach perfection more closely for a wider range of applications than existing architectures. The VSP approach further augments a general purpose DSP chip or core with the VSP wrapper ASIC circuitry for streamlined data operations.

MMX involves misalignment and data padding operation problems and the lack of circular addressing and other DSP addressing modes and instruction features. While a VSP can enjoy mere Kbytes in program space with 16-bit instructions, the host may require megabytes in its program space with MMX variable length instructions. Therefore, code size compression in VSP objects is another advantage. The VSP alleviates congestion in memory accesses as well as the PCI bus. Thus, a very key advantage of the new architecture relates to bandwidth problems in new media applications hitherto. The host processor aggravates the problems before trying to solve them. The VSP of the present proposal alleviates the problems while the best features of host processor performance remain.

The amount of local VSP SRAM memory needed to run a whole application is about half a megabit, and in many cases much less especially when only granules (software objects) of the application are run on the VSP. A VSP with minimal amount of on-chip memory may have to be augmented with external local SRAM memory which occupies an acceptable amount of printed circuit board real estate because the VSP circuitry replaces modem and audio cards of today. Also, the VSP chip can be designed to have adequate SRAM on-chip thereby obviating the need of external local SRAM memory.

A common data structure is used for each respective host software object and the corresponding VSP software object. At times, PCI bus traffic of not only VSP code but also large amounts of data can occur between the VSP and the host system memory. This PCI bus traffic is quite acceptable because it is bursty due to VSP data processing or interleaving, and because the VSP can spread out the transaction over time, thereby reducing bus bandwidth demanded by the VSP. PCI bandwidth is ample: maximum is 66 MHz.times.8 bytes=528 Mbytes/sec. Moreover, in the proposed USP architecture, the data passing over PCI is compressed and not already inefficiently decompressed by host processing. VSP instruction code size is minuscule compared to host code size. The whole premise of today's high-performance host CPU is to have host extract data from memory for decompression by the host CPU. But then the host CPU has to send the decompressed data over the PCI bus to the peripherals precisely because PCI is the mezzanine bus. Therefore, for host to decompress data and send decompressed data over the PCI bus is a much greater burden than for compressed data to be sent to the VSP wherein it is decompressed and sent without PCI burden to the I/O ports.

Multitasking operating systems such as Windows 95 and NT have multithreading capabilities on which the improvements piggyback. The operating system (OS) runs exclusively on the host, and not on the VSP. The OS is augmented with a DirectDSP API (application program interface) analogous to DirectX APIs under Windows through which applications can call VSP functionality. Further, the OS is endowed with a DirectDSP HAL (hardware abstraction layer) which interfaces to the DirectDSP software layer. To the system software is added software called a VSP Kernel which runs exclusively on the VSP and provides the software interface of the VSP to the DirectDSP HAL, DirectDSP software layer and ultimately the OS and the calling application.

Time-slicing operating system code prevents an application from monopolizing the host by allotting runtime for the application in time slices thereby allowing other applications to be time-division-multiplexed. A preemptive multitasking OS further introduces a priority scheme to allow preemption of one task by another of a high priority. The improved USP system software granulates, or breaks up, applications into software objects called granules. Time-slicing and granulation do not conflict or introduce complications in each other's presence. Time-slicing and prioritization are ways used for scheduling in Windows. Time-slicing comes below prioritization in the scheduling scheme. A granule can simply be a software thread scheduled and run under the Windows regime.

A software decoder, for example, has lower priority than that of a hardware event. The VSP by means of hardware interrupts can naturally preempt a host-based program and work to advantage in the Windows OS scheduler environment. The VSP briefly interrupts the host to raise its priority with the Windows OS scheduler. If the host were to lock out interrupts, it would simply become a single-tasking system, therefore the host should not do so. Thus, VSP is a "very good citizen" for the Windows OS.

Software tasks are each largely broken into fine granules that are easily modified and compiled not only on an x86 compiler but also a DSP compiler. ISVs (independent software vendors) can also download third-party granules of VSP code. VSP object code for a given origin source code of a granule is provided in a software object distinct from a software object containing x86 or other host processor object code compiled from the same origin source code.

The DirectDSP software schedules granules and responds when their execution completes. Indeed, the host CPU is multitasking and multithreading between granules which can be simply written as threads. Even though the host source code is granulated and recompiled, such recompiled source which has the OS active with multiple threads actually helps host performance on recompiled code compared to old code because the multitasking overhead of the OS is already taken for granted when a multitasking OS is chosen for the system. Even with "loose" or time-consuming OS code, which is sometimes encountered, the burden of OS multithreading is insignificant compared to the benefits gained when the old code is broken up into threads which will be run more optimally under Windows. When a thread which is waiting on resources is suspended, the rest of the task is still active. Alternatively when the old code is not broken up into multiple granules, it will bog down the host CPU while it is waiting on resources (akin to a single tasking environment).

If the DirectDSP software allocates two granules wherein one creates data and the other uses the data, a data dependency or synchronization issue is avoided by the system of "handles" by which pieces of software under Windows hand off from one task to another. Transactions under Windows OS are essentially file-based where source and destination handles are passed from one process/thread to another to facilitate program execution. Analogy with dataflow architecture applies except that software granules are linked between a host and DSP, rather than using close-coupled dataflow hardware. Analogy with link-list processing applies except that handles, not pointers, link the granules.

Advantageously, because of the judicious use of the system of handles as well as semaphores and interrupt preemption in a multitasking OS, no special synchronization flags are needed to resolve dependencies. Dataflow introduces overhead; Windows handles overhead already exist, and the granules introduce no extra overhead.

Consider an example: The handles help create the software analog of a hardware pipeline wherein operations overlap in different processes between the host CPU and the VSP. With granules and no DSP, MPEG (in FIG. 12 of incorporated U.S. patent application TI-Ser. No. 08/823,251) is executed by the host in frames each comprising a series of functions including Picture Reorder, Motion Estimation DCT, Q, VLC, Q-inverse, DCT-inverse for each frame wherein each granule hands off to the next granule via the handles. With granules and with VSP, MPEG is further executed with a software application pipeline and is load balanced efficiently as follows: Do previous-frame (N-1) Motion Estimation on VSP while host does current-frame (N) Picture Reorder. The host Picture Reorder hands off to VSP current-frame (N) Motion Estimation. Concurrently, previous-frame (N-1) Motion Estimation on VSP hands off to previous-frame (N-1) DCT on host. Host executes granules to end of previous frame (N-1) and then does next-frame (N+1) Picture Reorder as VSP completes current-frame (N) Motion Estimation, whereupon the cycle repeats. All granules execute in the correct order, but with advantageous overlap of processing of two frames at once in the software pipelining approach under the proposal. The granules can be allocated differently between host and VSP without confusion provided the allocation algorithm detects sufficient available MIPs in either host or VSP to do the allocation differently.

Both the x86 object and VSP object have the same data structure. Advantageously, the task either of them represents is executable by first selecting the host or the VSP, and then launching the corresponding software object for the task in the selected processor. The same data results either way.

Source code (e.g., C) leads to identically located data structures no matter which compiler flavor is used, because the header file in the DirectDSP API (application program interface) guarantees that the compiler will use the common data structure. The Windows OS manufacturer supplies a kit called the SDK to the ISVs and a kit called the DDK to the IHVs which they use in developing their software. If the software tasks are not revised into the granular form, the old application simply runs on the host as in the past. When the software tasks are rewritten into granular form for execution on the host and/or VSP(s) under Windows, the handles are already in the overhead. Therefore, calls to the DirectDSP API do not introduce new overhead. Furthermore, handoff transactions between granules occur within the thread and do not represent any call overhead to the OS.

To launch an object, the host runs the augmented Windows OS which determines relative loading of x86 and VSP MIPS at run-time. According to an allocation algorithm, the augmented Windows OS will either allocate the host software object to the host CPU or the corresponding VSP software object to the VSP. Meanwhile, data passes to and through system memory space according to the common data structure so that the processing site, as host or VSP, does not matter. This implies processor independence.

The above technology is applied at any advantageous point in the PC system using one or more VSPs (wrappers and DSPs). Improvements or additions occur primarily at the location of the North Bridge, AGP Graphics (advanced graphics port), South Bridge, or elsewhere on the PCI bus as PCI agents.

The wrapper acts as a scatter-gather bus master and I/O accelerator by itself that boosts throughput of a multitasking system (even without a DSP chip or core) by relieving the host of I/O chores and providing byte channeling of 32-bit Dword host data into byte-aligned 16-bit VSP word format without host or VSP intervention. The wrapper also has a memory buffer for modem, voice/telephony and audio data. With a DSP, the VSP wrapper can "walk" the entire virtual memory space of the host memory system without host intervention thereby making the VSP a super bus master with virtual memory addressing capability beyond simple scatter-gather bus mastering. With a DSP, the VSP wrapper can further create ping-pong and circular buffers to advantageously unify the buffers currently used in modem, voice and audio applications by replacing modem, voice/telephony and audio add-in cards with the VSP circuitry.

In one system approach, the original equipment manufacturer (OEM) sells the PC with the wrapper chip on the motherboard. In FIG. 19, a small DSP socket on the motherboard is provided but left empty for an overdrive DSP retrofit. The overdrive DSP is sold by retailers to users who wish to upgrade with VSP capability. Or the OEM itself fills the DSP socket in a differentiated computer system product. Alternatively, for added power, the VSP wrapper can be upgraded into a full-blown VSP as in FIG. 20 with an embedded DSP core leaving the external overdrive socket for the second DSP upgrade to the system.

Much of the OEM business cost derives from product support activity. The VSP (wrapper-DSP chip) approach advantageously adds substantial computing power and fits well into the existing PC business model. This added power allows the OEM to install software that virtualizes some of today's hardware. Accordingly, the field-support cost of fixing real hardware is reduced. Moreover, bugs in the software that virtualizes the hardware can be fixed by the OEM directly, by downloading diagnostics and patches over the Internet.

Each OEM can customize the software that virtualizes the hardware, thereby allowing differentiation of its products from those of other OEMs, even those products of other OEMs who adopt the wrapper and DSP improvements too. Also, OEMs can differentiate their products by adding the VSP wrapper and/or DSP on their own OEM-determined schedule between introductions of various generations of the host microprocessor. But suppose a next-generation host microprocessor will add capabilities that may make that next-generation host able to do much of the work that a current host-plus-VSP would do. In such case, the OEM advantageously adds differentiation by combining the VSP into its next-generation host system too.

To leverage software value via the above improved technology, vendors advantageously write software tasks in a popular source code such as C code. They compile the application with an x86 compiler into x86 code, and compile it again, but with a DSP compiler, into DSP code. They purchase the DSP compiler from the DSP manufacturer for purposes of the second compile.

By using C code, vendors are free of any need to actually write in DSP native (assembly) code itself, if DSP code is unfamiliar to them. The compilation from C code to DSP object code is not burdensome. Vendors may want to recompile their software anyway such as to accommodate host microprocessor MMX multimedia instruction extensions. Embedding the DSP software objects into the software product is as convenient as embedding MMX video graphics in applications.

In this way, the software vendors supplies user-attractive code which not only runs adequately on conventional x86 machines lacking a VSP, but also later provides a substantial performance improvement on machines having or updated to have a VSP. Since the applications, such as DirectX games software, check for presence of all relevant hardware capability in a given system anyway, the presence of the VSP wrapper alone or with VSP is detected by the application. Therefore improvements provided by this embodiment is totally transparent to the applications.

An example of a prior art system, from which more hardware is removed than which the wrapper/DSP adds, has a modem add-in card and an audio add-in card, among other add-in cards. These add-in cards are replaced by a single wrapper/DSP add-in card (or PCMCIA Cardbus dongle) which costs less, largely virtualizes application hardware, and more readily accommodates field testing remotely. Even greater savings occur when the wrapper/DSP is put on the motherboard.

IMPROVED SOFTWARE OPERATIONS AND PROCESSES

USP provides flexible digital signal processing MIPs for the PC and/or the Internet/Intranet platform. Various USP embodiments include improved methods, circuits and systems for combining and operating asymmetrical processors as VSPs (Virtual Signal Processors) as flexible, scalable, dynamically linked multi-processing virtual hardware for dynamically balancing MIPs among various processors (VSPs) in a system or a distributed/networked computing environment. In FIG. 7, VSPs are coupled to the system resources via internal (e.g. PCI/AGP, CPU) and external (e.g. IEEE 1394, USB) buses, LAN and WAN (e.g. ethernet, ATM). All VSPs are coupled to the computer main processor via software, the operating system, and shared main (host) memory.

FIGS. 17 and 50 show a VSP wrapper ASIC as logic coupled to a DSP. DSP backend interface logic couples the VSP to serial buses such as USB and IEEE 1394 to external peripherals.

In FIG. 92, improved software, herein called Direct DSP, DirectDSP HAL, DirectDSP WDM and ActiveDSP run on the host CPU/MMX. Further, software embodiments called VSP kernel and application granules (sub-tasks) run on the VSP core(s). FIG. 27
shows the relative software layers from the host OS to the VSP Kernel and VSP application granules below it (host application granules via emulation not shown). With multiple VSPs and kernels, multi-VSP resource management code is included in the Direct DSP HAL.

DirectDSP extends DirectX to intelligently distribute processing MIPs between the host CPU/MMX and the VSP(s) by parsing tasks into sub-tasks (granules) which then are run by either the host or VSP(s) in a dynamic and balanced fashion. Both host and VSP application granules are called by DirectDSP/DirectDSP HAL using multitasking and multithreaded Windows OS, COM-based (Component Object Model) DirectX and ActiveX as well as the host CPU/MMX and PC core logic. Direct DSP runs on top of the DirectDSP HAL or the DirectDSP WDM stack.

ActiveDSP is a name for same process embodiments for hardware accelerated multimedia services to ActiveX PC and Web applications. ActiveDSP is a software layer running on top of DirectDSP just as ActiveX is a layer on top of DirectX. ActiveDSP alternatively uses WDM Data Streaming provided by DirectDSP WDM or DirectDSP HAL to access VSP hardware.

The VSP Kernel and VSP application granules are DSP (digital signal processing) software modules running on a DSP core or DSP chip. DSP cores or chips from Texas Instruments range from the simple single instruction singledata (SISD) type to the advanced VLIW type and the choice should be both application and cost driven.

Computations burn up CPU MIPs. Memory transactions include program execution and data manipulation, I/O transactions include busmaster or slave system peripherals data transfers.

Because Windows is multi-tasking and multi-threaded, several tasks can use system memory simultaneously, wherein Windows manages the available memory and schedules multiple tasks. Blocks of memory called memory objects are allocated for run-time requirements. Allocated memory can also be movable and discardable wherein the memory objects are scattered around in the system memory map. A physically contiguous block of memory is allocated by gathering movable objects together into one contiguous object.

When a memory object is allocated, a handle, rather than a pointer, is generated to identify and to refer to the memory object. The handle is used to retrieve the current address of the allocated memory object. For example, a source handle references a source memory buffer. Processing puts data in a destination memory buffer which is referenced by a destination handle. When a task needs to access the memory object, the handle for that memory object is preferably locked down. The action of locking down a memory handle temporarily fixes the address of the memory object and provides a pointer to its beginning. While a memory handle is locked, Windows cannot move or discard the memory object. After the object is accessed or the object is not in use, the object handle is then unlocked to facilitate Windows memory management.

USP utilizes this fundamental memory management scheme to make a VSP an extension of the host CPU and to share host system memory and resources. USP provides a method for the VSP to grab memory object handles. Since Windows provides OS services for ascertaining the physical addresses of memory objects when they are locked down, the VSP grabs these handles by Direct DSP software operations that obtain the physical addresses of these handles through Windows and pass them on to the VSP. With these physical addresses, the VSP accesses memory objects (e.g. via the PCI bus) with VSP acting as a super busmaster for scatter-gather DMA transactions within the entire host accessible virtual memory space. The host CPU/MMX has elaborate paging hardware on-chip for accessing 64T bytes of virtual memory. VSP conveniently traverses the host virtual memory space as a super busmaster by using these handles (translated to physical addresses) provided by host and OS enhanced with DirectDSP operations.

In the hierarchy of a preemptive multi-threaded multi-tasking software system, each task (running state of a program) includes processes, threads (execution paths of a process) and procedures or function calls. In Windows, tasks are known as processes and the scheduler manages multiple threads on a preemptive basis. Improvements involve breaking down application tasks or processes into manageable threads and sub-tasks (granules) with fine granularity. A USP thread is written in host code which calls embedded application granules either written in host code or VSP code. Each granule can be as fine in granularity as a function call and uses memory transactions and VSP or host MIPS. The granule may also do I/O transactions which are regarded as memory transactions to and from system peripherals.

With the above handle mechanism, USP via DirectDSP dynamically allocates VSP MIPs and/or host CPU/MMX MIPs for computational loads and memory and I/O transactions. USP threads are written so that either host CPU/MMX or the VSP can perform computations and memory or I/O transactions by grabbing the suitable source handles and returning the results to the appropriate destination handles or peripherals (the VSP grabs these handles with the help of DirectDSP). This scheme allows MIPs distribution between the host and VSP.

If DirectDSP/DirectDSP HAL allocates two application granules wherein one creates data and the other uses the data, a data dependency or synchronization issue is avoided with this system of handles by which granules hand off from each to the next. Transactions under Windows OS are essentially file-based where source and destination handles are passed from one process/thread to another to facilitate program execution. Since Windows is a multi-tasking, multi-threaded OS, USP threads are synchronized with host operations (tasks or threads) with semaphores and mutexes which are synchronization objects in Windows for controlling process entry and exit of critical sections. Since Windows is also preemptive, a VSP application granule (embedded. in a USP thread) suitably preempts a host thread for Windows OS attention. This preemption is achieved through the hardware interrupt mechanism of the host CPU/MMX.

FIG. 47 (of the incorporated U.S. patent application Ser. No. 08/823,251) shows a 32-bit Windows preemptive multi-tasking multi-threaded software environment wherein a 32-bit USP driver thread (which either calls host granule(s) or is called by a client host granule for services) executes in full synchronization with a VSP application granule(s) running on the VSP hardware. The VSP granule as code embedded in a VSP thread is called from the DirectDSP HAL. In general, a VSP thread (vertical rectangle under DSP32.DLL) is a USP thread that either calls VSP granule(s) or is called by VSP granule(s) for services via a VSP hardware interrupt to the host CPU/MMX. A synchronization mechanism in the Windows OS is the event signaling semaphore mechanism and its associated event, as well as hardware interrupt preemption. In the above example, the synchronization mechanism comprises a WaitForSingleoBject semaphore for the USP driver thread, the SignalObject semaphore processed by the DirectDSP HAL, and VSP hardware interrupt preemption. The sequence of events is as follows:

USP driver thread (host Granule) calls DirectDSP HAL and waits on processing results from the VSP granule by synchronizing its operation with that of the VSP granule.

At this point, the USP host granule thread is actually suspended by waiting on the semaphore WaitForSingleObject i.e. waiting on resources that it needs from the VSP granule.

The VSP has finished processing and issues a hardware interrupt to the host.

The DirectDSP HAL sees this interrupt and services it while scheduling an Event (part of the signaling mechanism) which is associated with a SignalObject semaphore.

The signaling mechanism is complete by processing the Event in which the SignalObject semaphore is called to signal a WaitForSignalObject semaphore which suspends the host granule thread.

Processing now returns to the Virtual Machine (VM) where the host granule thread resides.

The host granule thread is now signaled by the Signal Object semaphore and comes out of suspension to grab the VSP processing results.

The host granule thread now continues its processing to completion with the VSP processing results i.e. resources it needed to complete its processing.

FIG. 48 (of the incorporated U.S. patent application Ser. No. 08/823,251) shows the 16-bit Windows software environment wherein a 16-bit USP driver process (vertical rectangle under DLL, which either calls host granule(s) or is called by a client host granule for services) executes in full synchronization with a VSP application granule(s) running on the VSP hardware. Again, the VSP granule is called from the DirectDSP HAL. Synchronization mechanism used is a callback notification mechanism and its associated event as well as hardware interrupt preemption.

The sequence of events is as follows:

The application register a callback function with the USP driver process via Windows. This callback function is now tied to the VSP hardware interrupt.

At this point, the USP driver process (DirectDSP DLL) calls the DirectDSP HAL to signal processing of the VSP granule(s).

The VSP has finished processing and issues a hardware interrupt to the host.

The DirectDSP HAL sees this interrupt and services it while scheduling an Event (part of the signaling mechanism).

The signaling mechanism is complete by processing the Event in which the Callback function (small vertical rectangle) is called to signal the application that the VSP has done processing.

Processing now returns to the Virtual Machine (VM) where the host application resides.

With the VSP in the PC, the host suitably also performs parallel processing and application pipelining using the handle mechanism. Tasks are set up to masquerade as I/O transactions using I/O busmasters to offload the host and avoid overtaxing the OS. As a super I/O busmaster, the VSP offloads the host using scatter-gather DMA capability for I/O transactions.

The VSP is tightly coupled to the host processor in task execution through the Windows OS and DirectDSP and yet physically decoupled (i.e. distributed) from the host to avoid a host-centric processing bottleneck cause of system imbalance wherein a very powerful host CPU hogs bus and memory bandwidth.

In FIG. 6 of the incorporated U.S. patent application Ser. No. 08/823,251, USP enhances the basic superscalar Pentium CPU by providing a third processing or execution pipe with out-of-order execution of DSPops (DSP macro operations comprised of DSP instructions) running on the VSP. An application program comprises processes (tasks) and/or threads with a series of Memory and/or I/O transactions. If the memory handles were pointers, this execution scheme resembles a processing link-list for the granules of each application. With each granule executing on a combination of the U, V pipes or the DSP pipe, the VSP constitutes a superscalar extension of the CPU/MMX with DSPops scheduled and dispatched to it via DirectDSP. The VSP can be programmed as a Scalar (SISD), Vector (SIMD), or VLIW macrostore for DSPops.

In the Pentium CPU/MMX, instructions are dispatched to the U and V Pipes and execution is complete on instruction boundary. In the Pentium Pro, instructions are further executed out-of-order and results are only committed as the execution of a group of instructions are complete with branch predictions correctly made. In the VSP Pipe, DSPops are dispatched in groups (granules) by DirectDSP and executed out-of-order with the instructions of the Pentium (or Pentium Pro). Executions of DSPops complete on I/O and memory transaction rather than CPU/MMX instruction boundary. Both the host CPU/MMX and the VSP application granules use the same data structures as defined by DirectDSP.

Porting applications to the USP platform is suitably a very gradual process and begins by replacing a small part of existing host code with a VSP application granule. For example, such host granules written to perform USP sub-tasks as function are recompiled to run on the VSP as application granules with little or no change necessary. This allows a gradual migration but with a quick-time-to-market productization approach for acceleration with VSP(s).

Some methods herein utilize file-based transactions under Windows OS where source and destination handles are passed from one process/thread to another to facilitate task execution. Handles resemble pointers, but they are distinct in this technology. In FIG. 7 of the incorporated U.S. patent application Ser. No. 08/823,251, CPU/MMX works on source data in source memory space by obtaining a source handle. The results are then passed to destination memory space via a handle for further processing by the VSP which grabs a destination handle via DirectDSP. The VSP processing results in destination space are forwarded with a handle to the next processing stage, perhaps by the CPU/MMX and so on. If handles are thought of as pointers (once the memory objects are locked down), some embodiments create a link-list of transactions and a task is broken up into a series of system memory transactions and/or I/O transactions performed with CPU/MMX or VSP MIPs where the CPU/MMX and VSP are essentially coupled together via shared host system memory.

In FIG. 28, the VSP program and data memory required for DSPops reside in the host system memory accessible via the VSP memory handle. USP utilizes system memory to reduce the VSP implementation cost. Example, for downloadable Wavetable Synthesis and using an instantiation of the USP architecture that supports DLS Wavetable (32 voices), the host system (main) memory utilization is about 512K bytes. For a typical application, the average amount of main memory required is less than 100K bytes. For fine granules such as DSP functions (e.g. DCT or FFT), the code size is only a few K bytes and for filtering operations it would be as negligible as a few bytes (VSP has single 16-bit instruction for filtering).

USP implements a software caching scheme to insert the VSP memory spaces into the host virtual memory space thereby utilizing the host's caching mechanism as well as its own for memory accesses. The program code and data for the VSP are continually cached into the DSP core or chip from the VSP wrapper program and data space in host (system) virtual memory for execution as shown in the VSP software caching model, FIG. 9 of the incorporated U.S. patent application Ser. No. 08/823,251. Since the data processed by the VSP are real-time digital signals or non-cacheable data, a software (paging) caching scheme rather than a traditional Pentium CPU caching scheme is used for the VSP. A traditional L1, L2 type of write-back or write-through cache might have the undesirable effect of cache thrashing when used with non-cacheable data. The VSP software or paging cache acts as macrostore for DSPops executed in parallel with host CPU/MMX instructions.

Only portions of program and/or data are cached in local VSP memory at any given time. This means that little or no VSP local memory is needed for applications, compared to dedicated-function DSP cards. Caching is performed on a host cache line basis and VSP application granules are dynamically replaced in VSP local memory, obviating burdens on host system operations for VSP download transactions.

In the host, most application data is byte oriented and stream I/O in nature. DirectDSP sets up streaming buffers avoiding the overhead of static buffers. The host application cannot guarantee that the data in its main memory is byte aligned or aligned to doubleword boundary. The VSP, however, uses data aligned as 16-bit words. In VSP implementation, the VSP wrapper logic utilizes a hardware channel Steering technique to speed up data transfers between host system memory and VSP over the PCI bus. Basically, the VSP can access any byte in random order out of a 32-bit double word within a cache line during a PCI transfer. No valuable VSP MIPs are lost to re-ordering data bytes or formatting bytes for VSP consumption.

In FIGS. 54, 54A, 54B, 54C, byte channeling refers to ordering bytes into word aligned boundaries. The hardware logic looks at the address in host system memory from which to start a transfer, and the destination address in VSP wrapper DPRAM (dual-ported memory organized as four byte columns.) From these addresses, a variable shift count is determined as: 0, 1 byte, 2 byte and 3 byte. As part of the FIFO I/F to the DPRAM, a counter is provided for each byte column in the memory. By incrementing these counters when the memory is enabled, the desired bytes are entered into the DPRAM in the correct position by the shifter (implemented as a data multiplexer).

Because the majority of VSP instructions are single word type, automatic DSP code compression and data size compression advantageously result, compared to 64-bit CPU/MMX application codes that require 64-bit program and data alignment to avoid a speed penalty. Since VSP program and data widths are only 16-bit wide and VSP instructions and addressing modes are more powerful than those of the host, VSP threads are much more compact in size than a host thread. This built-in program and data compression is very attractive for very memory intensive multimedia applications.

With a link-list of memory transactions and asymmetrical VSP multi-processing an application software pipeline is established wherein the CPU/MMX and asymmetrical VSP collaborate on task execution with pipeline stages as shown in FIGS. 12, 13, 14
and 17 of the incorporated U.S. patent application Ser. No. 08/823,251. Each pipeline stage can be executed by either the host CPU or the VSP to speed up the system throughput. If the host CPU is more efficient in writing to the screen, it suitalby performs the pipeline stage for displaying graphics by a granule allocation from DirectDSP. The VSP, due to its efficiency in signal decompression should perhaps work with compressed data upstream in the application pipeline to conserve system bus bandwidth. Accordingly, a decompression granule is allocated co VSP by DirectDSP. On the other hand, if the Graphics/Video controller has a Zoom Video Port, the granules are allocated to the VSP to write directly to the frame buffer. This shows the flexibility afforded by the USP architecture.

Some process embodiments advantageously redirect data to where it needs to be processed, thereby redistributing system MIPs and bandwidth utilized for compression and decompression tasks. For example, DirectDSP granule allocation dispatches compressed MPEG video/audio or AC3 audio to the VSP for processing where compressed audio transfers across the system bus instead of host-decompressed video/audio. In addition, both bus bandwidth and memory utilization are less burdened if the video/audio output is further sent to codec coupled to the VSP back-end. If the host CPU were to decompress MPEG or AC3 audio, it would have to send decompressed audio output across the system bus to the codec, thereby causing more bus bandwidth utilization. Also, because of program and data alignment issues of the host CPU/MMX architecture, more memory bandwidth/utilization is required. By contrast, the VSP decompression utilizes very compact DSP program code and efficiently handles non-cacheable audio/video data. Not only does hot processing use up more data and code memory bandwidth, but also multimedia non-cacheable data will also thrash the host L1 and L2 caches, with excessive uncontrollable latency detrimental to real-time signal processing.

In FIG. 12 of the incorporated U.S. patent application Ser. No. 08/823,251, with host CPU only, MPEG tasks are sequentially executed and the CPU only devotes a portion of the real-time to each task. Therefore, the time slots outside of each task are devoted to other tasks and can be considered as "dead time" as far as the current task is concerned.

In the lower two bands of this FIG. 12, the system has VSP and host processing the tasks in parallel. For example, DirectDSP may allocate the motion estimation task to the VSP which can devote most of the frame time to Motion Estimation alone for higher system throughput. In this way, the VSP advantageously uses system "dead time" inaccessible to the host CPU. Also, MIPs demand on the VSP is less than that of the host CPU since it has effectively borrowed more time (a whole frame interval) for executing the Motion Estimation task. In other words, lower bandwidth VSP can perform tasks previously requiring a high bandwidth CPU to perform.

In the MPEG example, at the end of Frame N-1, the host picture re-order processing block sets up the memory buffers for motion estimation for the VSP to perform in frame N so that the results are used by the host in frame N+1. This parallel pipelining method enhances other algorithms generally that use multiple frame for decode purposes wherein the method "steals" system dead time across a frame boundary, achieving a time dilation unavailable to a single CPU system. The VSP is fully integrated into the host architecture by operating as a second CPU which directly accesses or shares host resources. Advantageously, task partitioning into sub-tasks (granules like re-order, motion estimation, DCT and Q) fully utilizes the very different architectures of the host CPU/MMX and the VSP(s), resulting in compact code and efficient task execution.

In FIG. 28, all VSP memory and I/O transfers to and from the host system virtual memory are cacheline-based stream I/O operations. The VSP virtualizes the DMA controller and interrupt handler as follows:

Super busmaster with scatter-gather DMA to access all (e.g., 64 Tbytes) of virtual memory space in host memory. This entails "walking" individually scattered 4K pages under Windows 9x. I/O re-direction of data for bus-independent output or re-targeting of data to different output devices.

Stream I/O facilitates I/O and memory transfers at byte boundaries for host applications reducing the data alignment issues in the x86 architecture. Byte Steering is used to pick out the correct byte in a doubleword for the VSP word-based operations.

DSPops interleaved with memory and I/O transactions to minimize latency issues on the PCI bus and to maximize throughput. When there is another PCI agent on the bus, the VSP processes data instead of performing I/O or memory transfers thereby avoiding PCI bus latency.

Real-time multimedia interrupts are effectively virtualized and handled by the VSP instead of the host CPU/MMX to avoid host context switching overhead for external interrupts under Windows. Another implementation slows down an external high-frequency interrupt by splitting interrupt processing into two stages wherein the high-frequency stage is handled by the VSP with a guaranteed response time and the processed interrupt is passed on to the host if necessary as a low-frequency interrupt from the VSP. The host CPU/MMX then processes the low-freqency interrupt with a short interrupt service routine (ISR) which schedules a deferred procedural call (DPC) to finish off the processing for the external event. DPC does not interfere with the processing of other Windows threads, since the ISR is extremely short (i.e. small fixed overhead). Advantageously, other events, threads or processes are minimally locked out, thereby streamlining operations in a multi-tasking multi-threaded system and/or multiprocessor system.

Deterministic response time for real-time applications is afforded when the VSP is used to guarantee processing time to the external events/interrupts and control latency due to its processing for the most critical (high-frequency portion) part of the real-time event processing. The VSP operations blend into the Windows OS operations for optimum execution. In real time systems, latency refers to the total time that it takes the host CPU to acknowledge and handle an interrupt. Consider a time interval occupied by high-frequency VSP interrupt handling followed by low-frequency host ISR and then non-time-critical Windows thread execution with a DPC. That time interval encompasses all operations that handle an external real-time multimedia interrupt, and can be substantially determined and controlled according to the processes of operation and architectural embodiments disclosed herein.

In general, a multi-tasking, multi-threaded OS schedules tasks more efficiently if they appear to the OS as asynchronous I/O tasks which require minimal host intervention and less "thrashing" of the host cache(s). The DirectDSP, HAL, DSP kernel and VSP arrange multimedia tasks into this form. In this way, the system is more balanced and its throughput accelerates. Asynchronous I/O is a very powerful mechanism for real-time applications where each task can queue I/O loads (tasks) and continue processing without having to either wait or respond immediately to some end-of-I/O event. Apart from minimal host intervention and less cache "thrashing", this pays enormous dividends on multi-processor systems and reduces I/O overhead on single processor systems.

The VSP acting as a super busmaster becomes an asynchronous I/O controller which not only comprehends, spans and traverses the entire host virtual memory space but also provides processing MIPs with each transfer. The VSP acts as a powerful I/O "traffic cop" that streamlines host operations and increases system throughput.

USP can advantageously operate even with the wrapper only, in host-based signal processing. The wrapper ASIC acts as a standalone chip with a pass-through mode for I/O devices such as the audio, voice and modem codecs (or AC97 codec). In this pass-through mode, the VSP wrapper is either a slave or busmaster. As a busmaster, the VSP wrapper relieves the host of I/O chores.

Advantageously, USP does not need an OS of its own. See FIG. 8 of the incorporated U.S. patent application Ser. No. 08/823,251. Instead, USP uses Windows OS as its own OS via DirectDSP and the real-time VSP Kernel software (USP resource management is built into DirectDSP and the VSP kernel). This software architecture is both complementary and non-competing with the Windows OS. In the preemptive, multi-threaded, multi-tasking Windows OS, processes and threads are normally running at S/W IRQLs with lower priorities than the H/W IRQLs. Although threads can be raised to real-time high priority via software, they are still at or below IRQ2 (dispatch). In FIGS. 29, 30, 31 and 32, by tying a process or thread to a H/W event/interrupt (IRQ12-IRQ27), USP raises the process or thread priority to above other software (host-based) processes or threads.

Short interrupt service routines (ISRs) are used along with deferred procedural calls (DPCs) as well as I/O request packets (IRPs) to improve system latency and turnaround time. DirectDSP WDM (or DirectDSP HAL) operates at ring 0 to reduce ring transitions to ring 3 for resources. This provides software latency control for real-time applications.

Not only do VSPs efficiently handle real-time events and multimedia, they further enhance the Windows OS by virtualizing real-time Interrupts and DMAs. A VSP can even act as an MMX emulator/accelerator or a WDM accelerator accelerating the Windows OS.

Balancing system resources with USP to prevent or alleviate bus (CPU, memory and I/O) overloading, memory and I/O bottlenecks as well as the undesirable CPU-bound MIPs (i.e. stalled CPU) involves carefully analyzing resource (MIPs, memory and bus I/O) utilization of each application against run-time resources available. Along with each computational load, comes the associated memory and I/O loads to sustain its MIPs requirements. Load balancing options depend on remaining or available system resources.

USP architecture handles acceleration for multimedia tasks using in-line and multi-pass models and achieves dynamic load balancing in systems such as FIG. 1 of the incorporated U.S. patent application Ser. No. 08/823,251 and FIG. 1 herein. Improvements herein are provided in:

In-line acceleration model (Source to I/O & I/O to Destination)

host memory data to be processed for output to I/O devices

In-place processing of real-time stream I/O data for input to host memory

multi-pass acceleration model (Source & Destination Handles)

File-format conversion where files in host memory have to be converted and then returned to host memory

Frame-based compression & decompression algorithms in conjunction with Host CPU for parallel processing

In dynamic load balancing the DirectDSP software uses Microsoft's multitasking and multithreaded Windows OS and COM-based software to dynamically sense the system hardware capabilities when an application opens, and when it loads/unloads hardware resources for plug and play. COM-based objects are controlled by the COM-interface which allows an application to hange the characteristics of the available hardware platform when interrogated by the application. Thus, the USP architecture achieves system scalability and flexibility through dynamic hardware linking.

The DirectX COM-based API has an application query a system for hardware description and capabilities at run-time while substituting the absent hardware features with host emulation where possible. Unlike DirectX, which merely substitutes host emulation for absent hardware features, the improved process herein uses available VSP MIPS for emulation as well, and dynamically balances application loads.

Unlike DirectX, however, the improved process does not limit host emulation to absent hardware only. Instead, the process does use host emulation when the host is best for performing the application granules for load balancing purposes. An important difference is task allocation based on fine granularity.

For example, an application queries the DirectDSP API embodiment for system device configuration at run-time. DirectDSP in turn queries the DirectDSP HAL embodiment regarding the H/W device capabilities. In other words, DirectDSP dynamically interrogates the DirectDSP HAL for hardware availability and reports available VSP MIPs to the application in terms of hardware description and capabilities supported for a balanced system. Applications, however, cannot access DirectDSP HAL directly. They have to go through the DirectDSP layer