Home
Patent Search
IMT Blog
REGISTER
|
SIGN IN
United States Patent
5598514
Purcell , ; et al.
January 28, 1997
Title
Structure and method for a multistandard video encoder/decoder
Abstract
A structure and a format provide a video signal encoder under the MPEG (Motion Picture Experts Group) standard. In one embodiment, the video signal interface is provided with a decimator for providing input filtering for the incoming signals. In one embodiment, the central processing unit (CPU) and multiple coprocessors implements discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) and other signal processing functions, generating variable length codes, and provides motion estimation and memory management. The instruction set of the central processing unit provides numerous features in support for such features as alpha filtering, eliminating redundancies in video signals derived from motion pictures and scene analysis. In one embodiment, a matcher evaluates 16 absolute differences to evaluate a "patch" of eight motion vectors at a time.
Inventors:
Purcell; Stephen C.
(Mountain View,
CA
)
, Le Gall; Didier J.
(Los Altos,
CA
)
, Bose; Subroto
(Santa Clara,
CA
)
Assignee:
C-Cube Microsystems
(,
Milpitas
)
Appl. No.:
105253
Filed:
August 9, 1993
Current U.S. Class:
345/418
345/474
345/501
715/719
Field of Search:
395/118,114,128,133,152-154,162-166 358/133-138,140,141,142 382/232-240,244-253
U.S. Patent Documents
3252148
May 1966
Mitchell
3812467
May 1974
Batcher
4489395
December 1984
Sato
4514808
April 1985
Murayama et al.
4559608
December 1985
Young et al.
4591976
May 1986
Webber et al.
4779190
October 1988
O'Dell et al.
4816914
March 1989
Ericsson
4838685
June 1989
Martinez et al.
4870563
September 1989
Oguchi
4935942
June 1990
Hwang et al.
4973860
November 1990
Ludwig
5014187
May 1984
Debize et al.
5049993
September 1991
LeGall et al.
5099322
March 1992
Gove
5231484
July 1993
Gonzales et al.
5335321
August 1994
Harney et al.
Foreign Patent Documents
0192292
Aug., 1986
EP
0267578
May., 1988
EP
0287891
Oct., 1988
EP
0292943
Nov., 1988
EP
0325856
Aug., 1989
EP
0395271
Oct., 1990
EP
0444660
Sep., 1991
EP
0446001
Sep., 1991
EP
0447203
Sep., 1991
EP
0447234
Sep., 1991
EP
0453653
Oct., 1991
EP
0456394
Nov., 1991
EP
0466981
Jan., 1992
EP
0478132
Apr., 1992
EP
0500174
Aug., 1992
EP
0503956
Sep., 1992
EP
0528366
Feb., 1993
EP
0572263
Dec., 1993
EP
0574748
Dec., 1993
EP
0588668
Mar., 1994
EP
0637894
Feb., 1995
EP
2037117
Jul., 1980
GB
2236449
Apr., 1991
GB
63-245716
Feb., 1989
JP
9103123
Mar., 1991
WO
9106182
May., 1991
WO
9312486
Jun., 1993
WO
9321733
Oct., 1993
WO
Other References
Proceedings Of The IEEE 1991 Custom Integrated Circuits, May 1991, San Diego, CA. pp. 12.2.1-12.2.4, XP 000295731, Bolton et al., "A Complete Single-Chip Implementation of the JPEG Image Compression Standard". .
Article entitled "Image-Processing Chip Set Handles Full Motion Video", David Bursky, Electronic Design, Cleveland, Ohio pp. 117-120, May, 1993. .
IEEE 1988 Custom Integrated Circuits Conference, May 1988 Rochester, N.Y., US, pp. 24.5-1-24.5.4, XP 000011005 Keshlear et al. "A High Speed 16-Bit Cascadable Alu Using An Aspect Standard Cell Approach". .
IBM Technical Disclosure Bulletin, vol. 30, No. 11, Apr. 1988 New York US pp. 288-290, `Improved Zero Result Detection When Using A Carry Look-Ahead Adder`. .
Patent Abstracts of Japan vol. 8 No. 248 (P-313), 14 Nov. 1984 & JP-A-59 121539 (Fujitsu Kabushiki Kaisha). .
Xerox Disclosure Journal, vol. 13, No. 4, Aug. 1988 Stamford, Conn, US, pp. 229-234, XP 000098156, Marshall, `Fast NMOS Adder`. .
Patterson, et al. `Computer Architecture A Quantitative Approach` 1990, Morgan Kaufmann, San Mateo, CA US, XP 000407059. .
IBM Technical Disclosure Bulletin, vol. 25, No. 11A, Apr. 1983 New York US, pp. 5613-5620, `Phase Selector and Synchronizer for Two Phase Clock Circuits`. .
IEEE Transactions on Circuits and Systems, vol. 36, No. 10, Oct. 1989 New York US, pp. 1275-1280, XP 000085314 Schmidt "A Memory Control Chip For Formatting Data Into Blocks Suitable For Video Coding Applications" p. 1277, right column, paragraph 2--p. 1278, left column, paragraph 3. .
IEEE Transactions on Consumer Electronics, vol. 38, No. 3, Aug. 1992 New York US, pp. 325-340, XP 00011862 Netravali, et al. `A Codec for HDTV`
Primary Examiner:
Jankus; Almis R.
Attorney, Agent or Firm:
Skjerven, Morrill, MacPherson, Franklin & Friel Kwok; Edward C.
Claims
We claim:
1. A structure for encoding digitized video signals representing a series of frames of images, said digitized video signals being stored in an external memory system, said structure comprising:
a first and a second video ports, each video port being configurable to be either an input port or an output port for video signals;
a host bus interface circuit for interfacing with an external host computer;
a scratch-pad memory for storing a portion of said series of frames of images;
a processor for arithmetic and logic operations, wherein said processor computing coefficients of a discrete cosine transform of said portion of said series of frames of images, and for applying a quantization step for said coefficients to obtained quantized coefficients under a lossy compression algorithm;
a motion estimation unit for matching objects in motion between said frames of images, said motion estimation unit providing as data output motion vectors representing said motion of said objects in motion between said frames of images;
a variable-length coding unit for applying an entropy coding scheme on said quantized coefficients and said motion vectors to represent said video signals;
a global bus accessible by said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit, said global bus providing data transfer among said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit;
a processor bus having a higher bandwidth than said global bus for providing data transfer among said processor, said scratch-pad memory, and said variable-length coding unit; and
a memory controller for (a) controlling data transfers between said external memory and said structure, and (b) for controlling the uses of said global bus and said processor bus.
2. A structure as in claim 1, wherein said processor comprises:
an instruction memory for storing instructions executable by said processor;
a register file including a predetermined number of registers for storing operands;
an arithmetic and logic unit for providing arithmetic and logic operations for operands in said register file; and
a multiplication unit for performing multiplication operations among said operands and a result of said arithmetic and logic operations.
3. A structure as in claim 1, for encoding digitized video signals representing a series of frames of images, said digitized video signals being stored in an external memory system, said structure comprising:
a first and a second video ports, each video port being configurable to be either an input port or an output port for video signals;
a host bus interface circuit for interfacing with an external host computer;
a scratch-pad memory for storing a portion of said series of frames of images;
a processor for arithmetic and logic operations, wherein said processor computing coefficients of a discrete cosine transform of said portion of said series of frames of images, and for applying a quantization step for said coefficients to obtained quantized coefficients under a lossy compression algorithm;
a motion estimation unit for matching objects in motion between said frames of images, said motion estimation unit providing as data output motion vectors representing said motion of said objects in motion between said frames of images;
a variable-length coding unit for applying an entropy coding scheme on said quantized coefficients and said motion vectors to represent said video signals;
a global bus accessible by said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit, said global bus providing data transfer among said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit;
a processor bus having a higher bandwidth than said global bus for providing data transfer among said processor, said scratch-pad memory, and said variable-length coding unit; and
a memory controller for (a) controlling data transfers between said external memory and said structure, and (b) for controlling the uses of said global bus and said processor bus;
wherein said motion estimation unit comprises:
a window memory for storing a second portion of said series of frames of images, said second portion being a subset of said portion of said series of frames of images stored in said scratch-pad memory, said second portion of said series of frames of images including video data from a current frame and video data from a reference frame; and
a matcher for matching said video data from said current frame and said video data from said reference frame to evaluate a predetermined number of motion vectors.
4. A structure for encoding digitized video signals representing a series of frames of images, said digitized video signals being stored in an external memory system, said structure comprising:
a first and a second video ports, each video port being configurable to be either an input port or an output port for video signals;
a host bus interface circuit for interfacing with an external host computer;
a scratch-pad memory for storing a portion of said series of frames of images;
a processor for arithmetic and logic operations, wherein said processor computing coefficients of a discrete cosine transform of said portion of said series of frames of images, and for applying a quantization step for said coefficients to obtained quantized coefficients under a lossy compression algorithm;
a motion estimation unit for matching objects in motion between said frames of images, said motion estimation unit providing as data output motion vectors representing said motion of said objects in motion between said frames of images;
a variable-length coding unit for applying an entropy coding scheme on said quantized coefficients and said motion vectors to represent said video signals;
a global bus accessible by said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit, said global bus providing data transfer among said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit;
a processor bus having a higher bandwidth than said global bus for providing data transfer among said processor, said scratch-pad memory, and said variable-length coding unit; and
a memory controller for (a) controlling data transfers between said external memory and said structure, and (b) for controlling the uses of said global bus and said processor bus; wherein said first video port comprises a decimation filter for reducing the resolution of said video signals.
5. A system comprising a first and a second structures, each structure encoding digitized video signals representing a series of frames of images, said digitized video signals being stored in an external memory system, each structure comprising:
a first and a second video ports, each video port being configurable to be either an input port or an output port for video signals;
a host bus interface circuit for interfacing with an external host computer;
a scratch-pad memory for storing a portion of said series of frames of images;
a processor for arithmetic and logic operations, wherein said processor computing coefficients of a discrete cosine transform of said portion of said series of frames of images, and for applying a quantization step for said coefficients to obtained quantized coefficients under a lossy compression algorithm;
a motion estimation unit for matching objects in motion between said frames of images, said motion estimation unit providing as data output motion vectors representing said motion of said objects in motion between said frames of images;
a variable-length coding unit for applying an entropy coding scheme on said quantized coefficients and said motion vectors to represent said video signals;
a global bus accessible by said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit, said global bus providing data transfer among said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit;
a processor bus having a higher bandwidth than said global bus for providing data transfer among said processor, said scratch-pad memory, and said variable-length coding unit; and
a memory controller for (a) controlling data transfers between said external memory and said structure, and (b) for controlling the uses of said global bus and said processor bus;
wherein said first video port of said first structure and said first video port of said second structure are connected to receive said video signals, and said second video port of said first structure and said second video port of said second structure are connected to pass said video data between said first structure and said second structure.
6. A method for encoding digitized video signals representing a series of frames of images, said digitized video signals being stored in an external memory system, said method comprising the steps of:
providing a first and a second video ports, each video port being configurable to be either an input port or an output port for video signals;
using a host bus interface circuit to interface with an external host computer;
storing a portion of said series of frames of images in a scratch-pad memory;
providing a processor for arithmetic and logic operations, wherein said processor computing coefficients of a discrete cosine transform of said portion of said series of frames of images, and for applying a quantization step for said coefficients to obtained quantized coefficients under a lossy compression algorithm;
matching objects in motion between said frames of images using a motion estimation unit, said motion estimation unit providing as data output motion vectors representing said motion of said objects in motion between said frames of images;
applying in a variable-length coding unit an entropy coding scheme on said quantized coefficients and said motion vectors to represent said video signals;
providing a global bus accessible by said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit, said global bus providing data transfer among said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit;
providing a processor bus having a higher bandwidth than said global bus for providing data transfer among said processor, said scratch-pad memory, and said variable-length coding unit; and
providing a memory controller for (a) controlling data transfers between said external memory and said structure, and (b) for controlling the uses of said global bus and said processor bus.
7. A method as in claim 6, wherein said step of providing a processor comprises the steps of:
storing instructions executable by said processor in an instruction memory;
storing operands in a register file including a predetermined number of registers;
providing arithmetic and logic operations in an arithmetic and logic unit for operands in said register file; and
performing multiplication operations among said operands in a multiplication unit and a result of said arithmetic and logic operations.
8. A method for encoding digitized video signals representing a series of frames of images, said digitized video signals being stored in an external memory system, said method comprising the steps of:
providing a first and a second video ports, each video port being configurable to be either an input port or an output port for video signals;
using a host bus interface circuit to interface with an external host computer;
storing a portion of said series of frames of images in a scratch-pad memory;
providing a processor for arithmetic and logic operations, wherein said processor computing coefficients of a discrete cosine transform of said portion of said series of frames of images, and for applying a quantization step for said coefficients to obtained quantized coefficients under a lossy compression algorithm;
matching objects in motion between said frames of images using a motion estimation unit, said motion estimation unit providing as data output motion vectors representing said motion of said objects in motion between said frames of images;
applying in a variable-length coding unit an entropy coding scheme on said quantized coefficients and said motion vectors to represent said video signals;
providing a global bus accessible by said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit, said global bus providing data transfer among said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit;
providing a processor bus having a higher bandwidth than said global bus for providing data transfer among said processor, said scratch-pad memory, and said variable-length coding unit; and
providing a memory controller for (a) controlling data transfers between said external memory and said structure, and (b) for controlling the uses of said global bus and said processor bus;
wherein said matching step comprises the steps of:
storing in a window memory a second portion of said series of frames of images, said second portion being a subset of said portion of said series of frames of images stored in said scratch-pad memory, said second portion of said series of frames of images including video data from a current frame and video data from a reference frame; and
matching in a matcher said video data from said current frame and said video data from said reference frame to evaluate a predetermined number of motion vectors.
9. A method for encoding digitized video signals representing a series of frames of images, said digitized video signals being stored in an external memory system, said method comprising the steps of:
providing a first and a second video ports, each video port being configurable to be either an input port or an output port for video signals;
using a host bus interface circuit to interface with an external host computer;
storing a portion of said series of frames of images in a scratch-pad memory;
providing a processor for arithmetic and logic operations, wherein said processor computing coefficients of a discrete cosine transform of said portion of said series of frames of images, and for applying a quantization step for said coefficients to obtained quantized coefficients under a lossy compression algorithm;
matching objects in motion between said frames of images using a motion estimation unit, said motion estimation unit providing as data output motion vectors representing said motion of said objects in motion between said frames of images;
applying in a variable-length coding unit an entropy coding scheme on said quantized coefficients and said motion vectors to represent said video signals;
providing a global bus accessible by said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit, said global bus providing data transfer among said first and second video port, said host bus interface, said scratch-pad memory, said processor, said motion estimation unit, and said variable-length coding unit;
providing a processor bus having a higher bandwidth than said global bus for providing data transfer among said processor, said scratch-pad memory, and said variable-length coding unit; and
providing a memory controller for (a) controlling data transfers between said external memory and said structure, and (b) for controlling the uses of said global bus and said processor bus;
wherein said step of providing provides a first video port comprising a decimation filter for reducing the resolution of said video signals.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to integrated circuit designs; and, in particular, the present invention relates to integrated circuit designs for image processing.
2. Discussion of the Related Art
The Motion Picture Experts Group (MPEG) is an international committee charged with providing a standard (hereinbelow "MPEG standard") for achieving compatibility between image compression and decompression equipment. This standard specifies both the coded digital representation of video signal for the storage media, and the method for decoding. The representation supports normal speed playback, as well as other playback modes of color motion pictures, and reproduction of still pictures. The MPEG standard covers the common 525- and 625-line television, personal computer and workstation display formats. The MPEG standard is intended for equipment supporting continuous transfer rate of up to 1.5 Mbits per second, such as compact disks, digital audio tapes, or magnetic hard disks. The MPEG standard is intended to support picture frames of approximately 288.times.352 pixels each at a rate between 24 Hz and 30 Hz. A publication by MPEG entitled "Coding for Moving Pictures and Associated Audio for digital storage medium at 1.5 Mbit/s," included herein as Appendix A, provides in draft form the proposed MPEG standard, which is hereby incorporated by reference in its entirety to provide detailed information about the MPEG standard.
Under the MPEG standard, the picture is divided into a matrix of "Macroblock slices" (MBS), each MBS containing a number of picture areas (called "macroblocks") each covering an area of 16.times.16 pixels. Each of these picture areas is further represented by one or more 8.times.8 matrices which elements are the spatial luminance and chrominance values. In one representation (4:2:2) of the macroblock, a luminance value (Y type) is provided for every pixel in the 16.times.16-pixel picture area (i.e. in four 8.times.8 "Y" matrices), and chrominance values of the U and V (i.e., blue and red chrominance) types, each covering the same 16.times.16 picture area, are respectively provided in two 8.times.8 "U" and two 8.times.8 "V" matrices. That is, each 8.times.8 U or V matrix has a lower resolution than its luminance counterpart and covers an area of 8.times.16 pixels. In another representation (4:2:0), a luminance value is provided for every pixel in the 16.times.16 pixels picture area, and one
8.times.8 matrix for each of the U and V types is provided to represent the chrominance values of the 16.times.16-pixel picture area. A group of four contiguous pixels in a 2.times.2 configuration is called a "quad pixel"; hence, the macroblock can also be thought of as comprising 64 quad pixels in an 8.times.8 configuration.
The MPEG standard adopts a model of compression and decompression based on lossy compression of both interframe and intraframe information. To compress interframe information, each frame is encoded in one of the following formats: "intra", "predicted", or "interpolated". Intra encoded frames are least frequently provided, the predicted frames are provided more frequently than the intra frames, and all the remaining frames are interpolated frames. In a prediction frame ("P-picture"), only the incremental changes in pixel values from the last I- picture or P-picture are coded. In an interpolation frame ("B-picture"), the pixel values are encoded with respect to both an earlier frame and a later frame. By encoding frames incrementally, using predicted and interpolated frames, the redundancy between frames can be eliminated, resulting in a high efficiency in data storage. Under the MPEG, the motion of an object moving from one screen position to another screen position can be represented by motion vectors. A motion vector provides a shorthand for encoding a spatial translation of a group of pixels, typically a macroblock.
The next steps in compression under the MPEG standard provide lossy compression of intraframe information. In the first step, a 2-dimensional discrete cosine transform (DCT) is performed on each of the 8.times.8 pixel matrices to map the spatial luminance or chrominance values into the frequency domain.
Next, a process called "quantization" weights each element of the 8.times.8 transformed matrix, consisting of 1 "DC" value and sixty-three "AC" values, according to whether the pixel matrix is of the chrominance or the luminance type, and the frequency represented by each element of the transformed matrix. In an I-picture, the quantization weights are intended to reduce to zero many high frequency components to which the human eye is not sensitive. In P- and B- pictures, which contain mostly higher frequency components, the weights are not related to visual perception. Having created many zero elements in the 8.times.8 transformed matrix, each matrix can be represented without further information loss as an ordered list consisting of the "DC" value, and alternating pairs of a non-zero "AC" value and a length of zero elements following the non-zero value. The values on the list are ordered such that the elements of the matrix are presented as if the matrix is read in a zig.sub.-- zag manner (i.e., the elements of a matrix A are read in the order A00, A01, A10, A02, A11, A20 etc.). This representation is space efficient because zero elements are not represented individually.
Finally, an entropy encoding scheme is used to further compress, using variable-length codes, the representations of the DC coefficient and the AC value-run length pairs. Under the entropy encoding scheme, the more frequently occurring symbols are represented by shorter codes. Further efficiency in storage is thereby achieved.
The steps involved in compression under the MPEG standard are computationally intensive. For such a compression scheme to be practical and widely accepted, however, a high speed processor at an economical cost is desired. Such processor is preferably provided in an integrated circuit.
Other standards for image processing exist. These standards include JPEG ("Joint Photographic Expert Group") and CCITT H.261 (also known as "P.times.64"). These standards are available from the respective committees, which are international bodies well-known to those skilled in the art.
SUMMARY OF THE INVENTION
In accordance with the present invention, a structure and a method for encoding digitized video signals are provided. In one embodiment, the video signals are stored in an external memory system, and the present embodiment provides (a) two video ports each configurable to become either an input port or an output port for video signals; (b) a host bus interface circuit for interfacing with an external host computer; (c) a scratch-pad memory for storing a portion of the video image; (d) a processor for arithmetic and logic operations, which computes discrete cosine transforms and quantization on the video signals to obtain coefficients for compression under a lossy compression algorithm; (e) a motion estimation unit for matching objects in motion between frames of images of the video signals, and outputting motion vectors representing the motion of objects between frames; and (f) a variable-length coding unit for applying an entropy coding scheme on the quantized coefficients and motion vectors.
In one embodiment, a global bus is provided to be accessed by video ports, the host bus interface, the scratch-pad memory, the processor, the motion estimation unit, and the variable-length coding unit. The global bus provides data transfer among the functional units. In addition, in that embodiment, a processor bus having a higher bandwidth than the global bus is provided to allow higher band-width data transfer among the processor, the scratch-pad memory, and the variable-length coding units. A memory controller controls data transfers to and from the external memory while at the same time provides arbitration the uses of the global bus and the processor bus.
Multiple copies of the structure of the present invention can be provided to form a multiprocessor of video signals. Under such configuration, one of the video ports in each structure would be used to receive the incoming video signal, and the other video port would be used for communication between the structure and one or more of its neighboring structures.
In accordance with another aspect of the present invention, one of the two video port in one embodiment comprises a decimation filter for reducing the resolution of incoming video signals. In one embodiment, one of the video ports include an interpolator for restoring the reduced resolution video into a higher resolution upon video signal output.
In accordance with another aspect of the present invention, a memory with a novel address mechanism is provided to sort video signals arriving at the structure of the present invention in pixel interleaved order into several regions of the memory, such that the data in the several regions of this memory can be read in block interleaved order, which is used in subsequent signal processing steps used under various video processing standards, including MPEG.
In accordance with another aspect of the present invention, a synchronizer circuit synchronizes the system clock of one embodiment with an external video clock to which the incoming video signals are synchronized. The synchronization circuit provides for accurate detection of an edge transition in the external clock within a time period which is comparable with a flip-flop's metastable period, without requiring an extension of the system clock period.
In one embodiment of the present invention, a "corner turn" memory is provided. In this corner-turn memory, a selected region is mapped to two sets of addresses. Using an address in the first set of addresses, a row of memory cells are accessed. Using an address in the second set of addresses, a column of memory cells are accessed. The corner-turn memory is particularly useful for DCT and IDCT operations where each macroblock of pixels are accessed in two passes, one pass in column order, and the other pass in row order.
In accordance with another aspect of the present invention, a scratch pad memory having a width four times the data path of the processor is provided. In addition, two set of buffer registers, each set including registers of the width of the data path, are provided as buffers between the processor and the scratch pad memory. The buffer registers operates at the clock rate of the processor, while the scratch pad memory can operate at a lower clock rate. In this manner, the bandwidths of the processor and the scratch pad memory are matched without the use of expensive memory circuitry. Each set of buffer registers are either loaded from, or stored into, the scratch pad as a one register having the width of the scratch pad memory, but accessed by the processor individually as registers having the width of the data path. In one set of the buffer registers, each register is provided with two addresses. Using one address, the four data words (each having the width of the data path) are stored into the register in the order presented. Using the other address, prior to storing into the buffer register, a transpose is performed on the four halfwords of the higher order two data words. A similar transpose is performed on the four halfwords of the lower order two data words. The latter mode, together with the corner turn memory allows pixels of a macroblock to be read from, or stored into, the scratch pad memory either in row order or in column order.
In accordance with another aspect of the present Invention, the pixels of a macroblock are stored in one of two arrangements in the external dynamic random access memory. Under one arrangement, called the "scan-line" mode, four horizontally adjacent pixels are accessed at a time. Under the other arrangement, which is suitable for fetching reference pixels in motion estimation, pixels are fetched in tiles (4 by 4 pixels) in column order. A novel address generation scheme is provided to access either the memory for scan-line elements or for quad pels. Since most filtering involves quad pels (2.times.2 pixels), the quad pel mode arrangement is efficient in access time and storage, and avoids rearrangement and complex address decoding.
In accordance with another aspect of the present invention, the operand input terminals of the arithmetic and logic unit in the process is provided a set of "byte multiplexors" for rearranging the four 9-bit bytes in each operand in any order. Because each 9-bit byte can be used to store the value of a pixel, so that the arithmetic and logic unit can operate on the pixels in a quad pel stored in a 36-bit operand simultaneously, the byte multiplexor allows rearranging the relative positions of the pixels within the 36-bit operands, numerous filtering operations can be achieved by simply setting the correct pixel configuration. In one embodiment, in accordance with the present invention, filters for performing pixel offsets, decimations, in either horizontal or vertical directions, or both are provided using the byte multiplexor. In addition, the present invention provides higher compression ratios, using novel functions for (a) activities analysis, used in applying adaptive control of quantization, and (b) scene analysis, used in reduction of interframe redundancy.
In accordance with another aspect of the present invention, a fast detector of a zero result in an adder is provided. The fast zero detector includes a number of "zero generator" circuits and a number of zero propagator circuits. The fast detector signals the presence of a zero result within, as a function of the length of the adder's operands, logarithm time, rather than linear time.
In accordance with another aspect of the present invention, the present invention provides a structure and a method for a non-linear "alpha" filter. Under this non-linear filter, thresholds T.sub.1 and T.sub.2 are set by the two parameters m and n. If the absolute difference between the two input values of the non-linear filter are less than T.sub.1 or greater than T.sub.2, a fixed relative weight are accorded the input values, otherwise a relative weight proportional to the absolute difference is accorded the input values. This non-linear filter finds numerous application in signal processing. In one embodiment, the non-linear filter is used in deinterlacing and temporal noise reduction applications.
In accordance with another aspect of the present invention, a structure for performing motion estimation is provided, including: (a) a memory for storing said macroblocks of a current frame and macroblocks of a reference frame; (b) a filter receiving a first group of pixels from the memory for resampling; and (c) a matcher receiving the resampled first group of pixels and a second group of pixels from a current macroblock, for evaluation of a number of motion vectors. The matcher provides a score representing the difference between the second group of pixels and the first group of pixels for each of the motion vectors evaluated. In this embodiment, the best score over a macroblock is selected as the motion vector for the macroblock. In one embodiment, the matcher evaluates 8 motion vectors at a time using a 2.times.8 "slice" of current pixels and a 4.times.12 pixel reference area.
In accordance with another aspect of the present invention, a structure is provided for encoding by motion vectors a current frame of video data, using a reference frame of video data. The structure includes a memory circuit for storing (a) adjacent current macroblocks from a row j of current macroblocks, designated C.sub.j,p, C.sub.j,p+1, . . . , C.sub.j,p+n-1 in the order along one direction of the row of macroblocks; and (b) adjacent reference macroblocks from a first column i of reference macroblocks, designated R.sub.q,i, R.sub.q+1,i, . . . , R.sub.q+m-1,i and a second column C.sub.j+1,p C.sub.p+1,p+1, . . . , C.sub.j+1,p+n+1. The adjacent reference macroblocks are reference macroblocks within the range of the motion vectors, with each of said current macroblocks being substantially equidistance from the R.sub.q,i and Rq+.sub.m-1,i reference macroblocks. The structure of the present invention evaluates each of the adjacent current macroblocks against each of the adjacent reference macroblocks under the motion vectors, so as to select a motion vector representing the best match between each of said current macroblock and a corresponding one of said reference macroblocks. When evaluation of the current macroblock against the set of reference frame macroblock in the memory circuit is completed, the current macroblock C.sub.j,p is remove from the memory circuit and replaced by a current macroblock C.sub.j,p+n, said current macroblock C.sub.j,p+n being the current macroblock adjacent said macroblock C.sub.j,p+n-1. At the same time, the column of adjacent reference macroblocks R.sub.q,i, R.sub.q+1,i, . . . , R.sub.q+m-1,i are removed from the memory circuit and replaced by the next column of adjacent reference macroblocks R.sub.q,i+1, R.sub.q+1,i+1, . . . , R.sub.q+m-1,i+1. In this manner, each current macroblock, while in memory, is evaluated against the largest number of reference macroblocks which can be held in the memory circuit, thereby minimizing the number of time current and reference macroblocks have to be loaded into memory. Of course, for purely convenience reasons, the terms "rows" and "columns" are used to describe the relationship between current and reference macroblocks. It is understood that a column of current macroblocks can be evaluated against a row of reference macroblock, within the scope of the present invention.
In accordance with the present invention, the control structure for controlling evaluation of motion vectors is provided by a counter which includes first and second fields representing respectively the current macroblock and the reference macroblock being evaluated. Under the controlling scheme of one embodiment, each of the first and second fields are individually counted, such that when the first field reaches a maximum, a carry is generated to increment the count in the second field. The number of counts in the first and second fields are respectively, the number of current and reference macroblocks. In this manner, each current macroblock is evaluated completely with the reference macroblocks in the memory circuit.
In accordance with another aspect of the present invention, an adaptive thresholding circuit is provided in the zero-packing circuit prior to entropy encoding of the DCT coefficients into variable length code. In this adaptive threshold circuit, a current DCT coefficient is set to zero, if the immediately preceding and the immediately following DCT coefficients are both zero, and the current DCT coefficient is less than a programmable threshold. This thresholding circuit allows even higher compression ratio by extending a zero runlength.
The present invention is better understood upon consideration of the detailed description below and the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1a is a block diagram of an embodiment of the present invention provided in an MPEG encoder chip 100.
FIG. 1b shows a multi-chip configuration in which two copies of chip 100, chips 100a and 100b, are used.
FIG. 1c is a map of chip 100's address space.
FIG. 2 is a block diagram of video port 107 of chip 100 shown in FIG. 1.
FIG. 3a shows a synchronization circuit 300 for synchronizing video data arrival at port 107 with an external video source, which provides video at 13.5 Mhz under 16-bit mode, and 27 Mhz under 8-bit mode.
FIG. 3b shows the times at which the samples of video clock signal Vclk indicated in FIG. 3a are obtained.
FIG. 4a is a timing diagram of video port 107 for latching video data provided at 13.5 Mhz on video bus 190a under 16-bit mode.
FIG. 4b is a timing diagram of video port 107 for latching video data provided at 27 Mhz on video bus 190a under 8-bit mode.
FIG. 5a shows the sequence in which 4:2:2 video data arrives at port 107.
FIG. 5b is a block diagram of decimator 204 of video port 107.
FIG. 5c is a table showing, at each phase of the CIF decimation, the data output R.sub.out of register 201, the operand inputs A.sub.in and B.sub.in of 14-bit adder 504, the carry-in input C.sub.in, and the data output Dec of decimator 204.
FIG. 5d is a table showing, at each phase of the CCR 601 decimation, the data output R.sub.out of register 201, the operand inputs A.sub.in and B.sub.in of 14-bit adder 504, the carry-in input C.sub.in, and the data output Dec of decimator 204.
FIG. 6a is a block diagram of interpolator 206.
FIG. 6b is an address map of video FIFO 205, showing the partition of video FIFO 205 into Y region 651, U region 652 and V region 653, and the storage locations of data in a data stream 654 received from decimator 204.
FIG. 6c illustrates the generation of addresses for accessing video FIFO 205 from the contents of address counter 207, during YUV separation, or during video output.
FIG. 6d illustrates the sequence in which stored and interpolated luminance and chrominance pixels are output under interpolation mode.
FIG. 6e shows two block interleaved groups 630 and 631 in video FIFO 205.
FIG. 7a is an overview of data flow between memory blocks relating to CPU 150.
FIG. 7b illustrates in further detail the data flow between P memory 702, QMEM 701, registers R0-R23, and scratch memory 159.
FIG. 7c shows the mappings of registers P4-P7 into the four physical registers corresponding to registers P0-P3.
FIG. 7d shows the mappings between direct and alias addresses of the higher 64 36-bit locations in SMEM 159.
FIG. 8a is a block diagram of memory controller 104, in accordance with the present invention.
FIG. 8b shows a bit assignment diagram for the channel memory entries of channel 1.
FIG. 8c shows a bit assignment diagram for the channel memory entries of channels 0, and 3-7.
FIG. 8d shows a bit assignment diagram for the channel memory entry of channel 2.
FIG. 9a shows chip 100 interfaced with an external 4-bank memory system 103 in a configuration 900.
FIG. 9b is a timing diagram for an interleaved access under "reference" mode of the memory system of configuration 900.
FIG. 9c is a timing diagram for an interleaved access under "scan-line" mode of the memory system of configuration 900.
FIGS. 10a and 10b shows pixel arrangements 1000a and 1000b, which are respectively provided to support scan-line mode operation and reference frame fetching during motion estimation.
FIG. 10c shows the logical addresses for scan-line mode access.
FIG. 10d shows the logical addresses for reference frame fetching.
FIG. 10e shows a reference frame fetch in which the reference frame crosses a memory page boundary.
FIGS. 11a and 11b are timing diagrams showing respectively data transfers between external memory 103 and SMEM 159 via QG register 810.
FIG. 12 illustrates the pipeline stages of CPU 150.
FIG. 13a shows a 32-bit zero-lookahead circuit 1300, comprising 32 generator circuits 1301 and propagator circuits.
FIG. 13b shows the logic circuits for generator circuit 1301 and propagator circuit 1302.
FIGS. 14a and 14b show schematically the byte multiplexors 1451 and 1452 of ALU 156.
FIG. 15a is a block diagram of arithmetic unit 750.
FIG. 15b is a schematic diagram of MAC 158.
FIG. 15c(i) illustrates an example of "alpha filtering" in the mixing filter for combining chroma during a deinterlacing operation.
FIG. 15c(ii) is a block diagram of a circuit 1550 for computing the value of alpha.
FIGS. 15c-3 to 15c-6 show the values of alpha obtainable from the various values of parameters m and n.
FIGS. 15d(i)-15d(iv) illustrate instructions using the byte multiplexors of arithmetic unit 750, using one mode selected from each of the HOFF, VOFF, HSHRINK and VSHRINK instructions, respectively.
FIG. 15e shows the pixels involved in computing activities of quad pels A and B as input to a STAT1 or STAT2 instruction.
FIG. 15f shows a macroblock of luminance data for which a measure of activity is computed using repeated calls to a STAT1 or a STAT2 instruction.
FIGS. 16a and 16b are respectively a block diagram and a data and control flow diagram of motion estimator 111.
FIG. 16c is a block diagram of window memory 705, showing odd and even banks 705a and 705b.
FIG. 16d shows how, in the present invention, vertical half-tiles of a macroblock are stored in odd and even memory banks of window memory 750.
FIG. 17 illustrates a 2-stage motion estimation algorithm which can be executed by motion estimator 111.
FIGS. 18a and 18b show, with respect to reference macroblocks, a decimated current macroblock and the range of a motion vector having an origin at the upper right corner of the current macroblock for the first stage of a B frame motion estimation and a P frame motion estimation respectively.
FIG. 18c shows, with respect to reference macroblocks, a full resolution current macroblock and the range of a motion vector having an origin at the upper right corner of the current macroblock for the second stage of motion estimation in both P-frame and B-frame motion estimations.
FIG. 18d shows the respectively locations of current and reference macroblocks in the first stage of a B frame motion estimation.
FIG. 18e shows the respective locations of current and reference macroblocks in the first stage of a P frame motion estimation.
FIG. 18f shows both a 4.times.4 tile current macroblock 1840 and a 5.times.5 tile reference region 1841 in the second stage of motion estimation.
FIG. 18g shows the fields of a state counter 1890 having programmable fields for control of motion estimation.
FIG. 18h shows the four possibilities by which a patch of motion vectors crosses a reference frame boundary.
FIG. 18i shows the twelve possible ways the reference frame boundary can intersect the reference and current macroblocks in window memory 705 under the first stage motion estimation for B-frames.
FIG. 18j shows, for each of the 12 cases shown in FIG. 18h, the INIT and WRAP values for each of the fields in state counter 1890.
FIG. 18k shows the twenty possible ways the reference frame boundary can intersect the current and reference macrobrocks in window memory 705.
FIG. 18l shows, for each of the twenty cases shown in FIG. 18k, the corresponding INIT and WRAP values for each of the fields of state counter 1890.
FIGS. 18m-1 to 18m-3 show the clipping of motion estimation with respect to the reference frame boundary for either the second stage of a 2-stage motion estimation, or the third stage of a 3-stage motion estimation.
FIG. 18n provides the INIT and WRAP values for state counter 1890 corresponding to the reference frame boundary clipping shown in FIGS. 18m-1 and 18m-2.
FIG. 19a illustrates the algorithm used in matcher 1606 for evaluate eight motion vectors over eight cycles.
FIG. 19b shows the locations of the "patch" of eight motion vector evaluated for each slice of current pixels.
FIG. 19c shows the structure of matcher 1608.
FIG. 19d shows the pipeline in the motion estimator 111 formed by the registers in subpel filter 1606.
FIGS. 20a and 20b together form a block diagram of VLC 109.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
1. Overview
FIG. 1a is a block diagram of an embodiment of the present invention provided in an encoder/decoder integrated circuit 100 ("chip 100"). In this embodiment, chip 100 encodes or decodes bit stream compatible with MPEG, JPEG and CCITT H.64. As shown in FIG. 1a, chip 100 communicates through host bus interface 102 with a host computer (not shown) over 32-bit host bus 101. Host bus interface 102 implements the IEEE 1196 NuBus standard. in addition, chip 100 communicates with an external memory
103 (not shown) over 32-bit memory bus 105. Chip 100's access to external memory 103 is controlled by a memory controller 104, which includes dynamic random access memory (DRAM) controller 104a and direct memory access (DMA) controller 106. Chip 100
has two independent 16-bit bidirectional video ports 107 and 108 receiving and sending data on video busses 190a and 190b respectively. Video ports 107 and 108 are substantially identical, except that port 107 is provided with a decimation filter, and port 108 is provided with an interpolator. Both the decimator and the interpolator circuits of ports 107 and 108 are described in further detail below.
The functional units of chip 100 communicate over an internal global bus 120, these units include the central processing unit (CPU) 150, the variable-length code coder (VLC) 109, variable-length code decoder (VLD) 110, and motion estimator 111. Central processing unit 150 includes the processor status word register 151, which stores the state of CPU 150, instruction memory ("I mem") 152, instruction register 153, register file ("RMEM") 154, which includes 31 general purpose registers R1-R31, byte multiplexor 155, arithmetic logic unit ("ALU") 156, memory controller 104, multiplier-accumulator (MAC) 158, and scratch memory ("SMEM") 159, which includes address generation unit 160. Memory controller 104 provides access to external memory 103, including direct memory access (DMA) modes.
Global bus 120 is accessed by SMEM 159, motion estimator 111, VLC 109 and VLD 110, memory controller 104, instruction memory 152, host interface 102 and bidirectional video ports 107 and 108. A processor bus 180 is used for data transfer between SMEM 159, VLC 109 and VLD 110, and CPU 150.
During video operations, the host computer initializes chip 100 by loading the configuration registers in the functional units of chip 100, and maintains the bit streams sending to or receiving from video ports 107 and 108.
Chip 100 has an memory address space of 16 megabytes. A map of chip 100's address space is provided in FIG. 1c. As shown in FIG. 1c, chip 100 is assigned a base address. The memory space between the base address and the location (base address+7FFFFF.sup.1) is reserved for an external dynamic random access memory (DRAM). The memory space between location (base address+800000) to location (base address+9FFFFF) is reserved for registers addressable over global bus 120. The memory space between location (base address+A00000) and location (base address+BFFFFF) is reserved for registers addressable over a processor bus or write-back bus ("W bus") 180a. A scratch or cache memory, i.e. memory 159, is allocated the memory space between location (base address+C00000) and location (base address+FFFFFF).
A multi-chip system can be built using multiple copies of chip 100. FIG. 1b shows a two-chip configuration 170, in which two copies of chip 100, chips 100a and 100b are provided. Up to 16 copies of chip 100 can be provided in a multi-chip system. In such a system, video port 108 of each chip is connected to a reference video bus, such as bus 171, which is provided for passing both video data and non-video data between chips. Each chip receives video input at port 107. In FIG. 1b, the video input port 107 of each chip receives input data from external video bus 172. Each chip is provided a separate 16-megabyte address space which is not overlapping with other chips in the multi-chip configuration.
2. Video Ports 107 and 108
Video ports 107 and 108 can each be configured for input or output functions. When configured as an input port, video port 107 has a decimator for reducing the resolution of incoming video data. When configured as an output port, video port 108
has an interpolator to output data at a higher resolution than chip 100's internal representation. FIG. 2 is a block diagram of video port 107. Video port 107 can operate in either a 16-bit mode or an 8-bit mode. When the video port is configured as an input port, video data is read from video bus 109a into 16.times.8 register file 201, which is used as a first-in-first-out (FIFO) memory under the control of read counter 202 and write counter 203. Under 8-bit input mode, read counter 202 receives an external. signal V.sub.-- active, which indicates the arrival of video data. Decimation filter or decimator 204, which receives video data from register file 201, can be programmed to allow the data received to pass through without modification, to perform CCR 601 filtering, or CIF decimation. In video port 108, where decimator 204 is absent, only YC.sub.b C.sub.r separation is performed.
The results from decimator 204 are provided to a 32.times.4-byte video FIFO (VFIFO) 205. The contents of video FIFO 205 are transferred by DMA, under the control of memory controller 104, to external memory 103. Because various downstream processing functions, e.g. DCT, IDCT operations or motion estimation, operate on chrominance and luminance data separately, chrominance and luminance data are separately stored in external memory 103 and moved into and out of video FIFO 205 blocks of the same chrominance or luminance type. Typically, the blocks of chrominance and luminance data covering the same screen area are retrieved from external memory 103 in an interleaved manner ("block interleaved" order). By contrast, input and output of video data on video busses 109a and 109b are provided sample by sample, interleaving chrominance and luminance types ("pixel interleaved" order). To facilitate the sorting of data from pixel interleaved order to block interleaved order ("YUV separation"), during data input, and in the other direction during data output, a special address generation mechanism is provided. This address generation mechanism, which is discussed in further detail below, stores the pixel interleaved data arriving at video port 107 or 108 into video FIFO 205 in block interleaved order. During output, the address generation mechanism reads block interleaved order data from video FIFO 205 in pixel interleaved order for output.
Address counters 207 and 208 are provided to generate the addresses necessary for reading and writing data streaming into or out of video FIFO 205. Address counter 207 is a 9-bit byte counter, and address counter 208 is a 7-bit word counter. In this embodiment, two extra bits are provided in each of counters 207 and 208, to allow video FIFO 205 to overflow without losing synchronization with the external video data stream, in the event that a DMA transfer to and from external memory 103 cannot take place in time.
When the video port is configured for video output, video data is retrieved from external memory 103 and provided to interpolator 206, which can be programmed to allow the data to pass through without modification or to provide a (1,1) interpolation. The output data of interpolator 206 is provided as output of chip 100 on video bus 109a.
a. The Synchronizer
Chip 100 operates under an internal clock ("system clock") of chip 100 at a rate of 60 Mhz. However, incoming video data are synchronized with an external clock ("video clock"). Under 8-bit mode, video data arrive at video port 107 at 27 Mhz. Under 16-bit mode, video data arrive at video port 107 at 13.5 Mhz. The system and video clocks are asynchronous with respect to each other. Consequently, for the video data to be properly received, a synchronization circuit 300, which is shown in FIG.
3a, is provided to synchronize the video data arriving at video port 107.
FIG. 4a shows a timing diagram of video port 107 under 16-bit input mode. As shown in FIG. 4a, 16-bit video data arrives at port 107 synchronous with an external video clock signal Vclk 404a, i.e. the video clock, at 13.5 Mhz. Internally, the synchronization circuit generates a write signal 401, which is derived from detecting the transitions of video clock 404a, to latch the 16-bit video data into register file 201 as two 8-bit data. FIG. 4a shows the data stream 403a representing the 8-bit data stream. In FIG. 4a, 16-bit video data are ready at video port 107 at times t.sub.0 and t.sub.2, and 8-bit video data are latched at times t.sub.0, t.sub.1, t.sub.2, and t.sub.3.
FIG. 4b shows a timing diagram of video port 107 operating under 8-bit input mode. Under the 8-bit input mode, the write signal 401, which is derived from detecting the transitions of video clock 404b, latches at into register file 201 each
8-bit data word of video data stream 403a at times t.sub.0, t.sub.1, t.sub.2, and t.sub.3.
Since the external video clock is asynchronous to the internal system clock, valid data can be latched only within a window of time after a rising edge of the video clock. Thus, valid data are latched only when the rising edges of the video clock are properly detected. In the prior art, such rising edges are detected by sampling the video clock using a flip-flop. However, if the rising edge of the video clock occurs at a time so close to the sampling point that it violates the set-up or the hold time of the flip-flop, the flip-flop can enter a metastable state for an indefinite period of time. During this period of metastability, another sampling by the flip-flop on the input video clock signal cannot take place without risking the loss of data. In chip 100, where the usual time for the output data of a flip-flop to settle is approximately 3 nanoseconds, this metastable period can exceed 12 nanoseconds.
Under the 8-bit input mode, a rising edge in the external video clock occurs every 37 nanoseconds. To detect this rising edge, the sampling frequency is required to be at least twice the frequency of the video clock Vclk, which translates to a period of no more than 18.4 nanoseconds. As mentioned above, if a rising edge occurs too closely in time to a sampling point, the sampling flip-flop enters into a metastable state. Because a metastable flip-flop may require in excess of 12 nanoseconds to resolve, i.e. more than half of the available time between arrivals of the clock edges of the video clock, the detections of rising edges in the video clock signal occur in an unpredictable manner. In certain circumstances, some rising edges would be missed. (In the 16-bit mode, however, because the input data arrives approximately every 74 nanoseconds, there is ample time for the metastable flip-flop to resolve before the arrival of the next rising edge of the video clock).
To ensure that a rising edge of the external video clock is always caught, the external video clock is sampled at both the rising edges and the falling edges of the system clock. By contrast, the video data at video port 107 or 108 are only sampled at the rising edges of the system clock. A synchronization circuit 300, shown in FIG. 3a, is provided to detect the edges on the video clock.
As shown in FIG. 3a, the video clock (Vclk) is provided to the data inputs of two 2-bit shift registers 301 and 302. Shift register 301 comprises D flip-flops 301a and 301b, and shift register 302 comprises D flip-flop 302a and 302b. Shift registers 301 and 302 are clocked by the rising and the falling edges of system clock SClk, respectively. In addition, the output data of shift register 301 is provided to a data input terminal of D flip-flop 305, which is also clocked by the falling edge of system clock Sclk. Preferably, D flip-flop 301a is skewed to have a rapid response to a rising edge at its data input terminal. Likewise, D flip-flop 302a is skewed to have a rapid response to a falling edge at its data input terminal. Such response skewing can be achieved by many techniques known in the art, such as the use of ratio logic and the use of a high gain in the master stage of a master-slave flip-flop.
NAND gates 310-313 are provided in an AND-OR configuration. NAND gates 310 and 311 each detect a rising edge transition, and NAND gate 312 detects a falling edge transition. An edge transition detected in any of NAND gates 310-312 results in a logic `1` in NAND gate 313. NAND gate 312 is used in the 16-bit mode to detect a falling edge of the video clock. This falling edge is used in the 16-bit mode to confirm latching of the second 8-bit data of the 16-bit data word on video port 107.
The operation of synchronization circuit 300 can be described with the aid of the timing diagram shown in FIG. 3b and the time annotations indicated on the signal lines of FIG. 3a. FIG. 3b shows the states of system clock signal (Sclk) at times t.sub.1 to t.sub.4. The time annotation on each signal line in FIG. 3a indicates, at time t.sub.4, the sample of the video clock held by the signal line. For example, since the sample of the video clock at time t.sub.1 propagates to the output terminal of D flip-flop 301b after two rising edges of the system clock, the output terminal of D flip-flop 301b at time t.sub.4 is annotated "t.sub.1 " to indicate the value of D flip-flop 301b's output data. Similarly, at time t.sub.4, which is immediately after a falling edge of the system clock, the output datum of D flip-flop 305 is also labelled "t.sub.1 ", since it holds the sample of the video clock at time t.sub.1.
At time t.sub.4, therefore, NAND gate 310 compares an inverted sample of the video clock at time t.sub.1 with a sample of the video clock at time t.sub.2. If a rising edge transition occurs between times t.sub.1 and t.sub.2, a zero is generated at the output terminal of NAND gate 310. NAND gate 310, therefore, detects a rising edge arriving after the sampling edge of the system clock. At the same time, NAND gate 311 compares an inverted sample of the video clock at time t.sub.2 with a sample of the video clock at time t.sub.3. Specifically, if a rising edge occurs between times t.sub.2 and t.sub.3, a zero is generated at the output terminal of NAND gate 311. Thus, NAND gate 311 detects a rising edge of the video clock arriving before the sampling edge of the system clock.
The output datum of NAND gate 313 is latched into register 314 at time t.sub.5. The value in register 314 indicates whether a rising edge of Vclk is detected between times t.sub.1 and t.sub.3. This value is reliable because, even if D flip-flop
301a enters into a metastable state as a result of a rising edge of video clock signal Vclk arriving close to time t.sub.3, the metastable state would have been resolved by time t.sub.5.
In video port 107, NAND gate 312 is provided to detect a falling edge of the video clock under the 16-bit mode of operation.
b. The Decimator
Video port 107 processes video signals of resolutions between CCR 601 (i.e. 4:2:2, 720.times.480) and QCIF (176.times.144). In one application, CCR 601 video signals are decimated by decimator 204 to CIF (352.times.288) resolution. FIG. 5a shows the sequence in which CCR 601 Y (luminance), C.sub.b and C.sub.r (chrominance) data arrive at port 107.
Decimation is performed by passing the input video through digital filters. In CCR 601 filtering, the chrominance data are not filtered, but the digital filter for luminance data provides as filtered pixels, each denoted Y*, according to the equation: ##EQU1## where Y.sub.0 is the luminance data at the center tap, and Y.sub.-1 and Y.sub.1 are luminance data of the pixels on either side of pixel Y.sub.0.
In this digital filter, after providing as output the filtered luminance pixel Y*.sub.0, the center tap moves to input luminance sample Y.sub.1.
For CIF decimation, the digital filter for luminance samples has the equation, ##EQU2## where Y.sub.-3, Y.sub.-2, Y.sub.-1, Y.sub.0, Y.sub.1, Y.sub.2, Y.sub.3 are consecutive input luminance data (Y.sub.-2 and Y.sub.2 are multiplied with a zero coefficient in this embodiment).
Unlike the CCR 601 filtering, the center tap moves to Y.sub.2, so that the total number of filtered output samples is half the total number of input luminance samples to achieve a 50% decimation. Under CIF decimation, C.sub.r and C.sub.b type chrominance data are also filtered and decimated. The decimation equations are: ##EQU3## where Cr.sub.0 and Cr.sub.-1, and Cb.sub.0 and Cb.sub.-1 are consecutive samples of the C.sub.r and C.sub.b types. The C.sub.b and C.sub.r filters then operate on the samples Cr.sub.1 and Cr.sub.2, Cb.sub.1, and Cb.sub.2 respectively. Consequently, under CIF decimation, the number of filtered output samples in each of the C.sub.b and C.sub.r chrominance types is half the number of the corresponding chrominance type input pixels.
FIG. 5b is a block diagram of decimator 204. As shown in FIG. 5b, Decimator 204 comprises phase decoder 501, multiplexors 502 and 503, a 14-bit adder 504, latch 505 and limiter 506. Phase decoder 501 is a state machine for keeping track of input data into decimator 204, so as to properly sequence the input samples for digital filtering. FIG. 5c is a table showing, at each phase of CIF decimation, the data output R.sub.out of register 201, the operand inputs A.sub.in and B.sub.in, and the carry-in input C.sub.in of adder 504, and the data output Dec of decimator 204 after limiting at limiter 506. Similarly, FIG. 5d is a table showing, at each phase of the CCIR 601 decimation, the data output R.sub.out of register 201, the operand inputs A.sub.in and B.sub.in, and the carry-in input C.sub.in of adder 504, and the data output Dec of decimator 204 after limiting at limiter 506.
During a decimation operation, a data sample is retrieved from register file 201. The bits of this data sample are shifted left an appropriate number of bit positions, or inverted, to scale the data sample by a factor of 4, 8, 16 or -1, before being provided as input data to multiplexor 502. When scaling by 16 is required, 15 is added to the input datum to multiplexor 502 to compensate precision loss due to an integer division performed in limiter 506. Multiplexor 502 also receives as an input datum the latched 14-bit result of adder 504 right-shifted by three bits. Under the control of phase decoder 501, multiplexor 502 selects one of its input data as an input datum to adder 504, at adder 504's A.sub.in input terminal. Multiplexor
503 selects the data sample (left-shifted by four bits) from register 201, a constant zero, or the latched result of 14-bit adder 504. The output datum of multiplexor 503 is provided as data input to 14-bit adder 504, at the B.sub.in input terminal.
The output datum of 14-bit adder 504 is latched at the system clock rate (60 Mhz) into register 505. Limiter 506 right-shifts the output datum of register 505 by 5 bits, so as to limit the output datum to a value between 0 and 255. The output datum of limiter 506 is provided as the data output of decimator 204.
As mentioned above, video port 108 can alternatively be configured as an output port. When configured as an output port, port 108 provides, at the user's option, a (1, 1) interpolation between every two consecutive samples of same type chrominance or luminance data.
FIG. 6a shows interpolator 206 of chip 100. As shown in FIG. 6a, during video output mode, an address generator 601, which includes address counters 207 and 208, is provided to read from video FIFO 205 samples of video data. Consecutive samples of video data of the same type are latched into 8-bit registers 602 and 603. Data contained in register 602 and 603 are provided as input operands to adder 604. Each result of adder 604 is divided by 2, i.e. right-shifted by one bit, and latched into register 605. In this embodiment, registers 602 and 603 are clocked at 60 Mhz, and register 605 is clocked at 30 Mhz.
When video bus 109a is configured as an input bus, video FIFO 205 receives from decimator 204 the decimated video data, which is then transferred to external memory 103. Alternatively, when video bus 109a is configured as an output bus, video data are received from external memory 103 and provided in a proper sequence to interpolator 206 for output to video bus 109a. The operation of the video FIFO in video port 107 is similar to that of video FIFO 205.
When YUV separation is performed during input mode, or when interpolation is performed during output mode, video FIFO 205 is divided into four groups of locations ("block interleaved groups"). Each block interleaved group comprises a 16-byte "Y-region", an 8-byte "U-region", and an 8-byte "V-region". Data transfers between video FIFO 205 and external memory 103 occur as DMA accesses under memory controller 104's control. Address counters 207 and 208 generate the addresses required to access video FIFO 205.
FIG. 6b is an address map 650 of a block interleaved group in video FIFO 205, showing the block interleaved group partitioned into Y-region 651, U-region 652 and V-region 653. A data stream 654 arriving from decimator 204 is shown at the top of address map 650. Shown in each of the regions are the locations of data from data stream 654.
Address map 650 also represents the data storage location for performing interpolation, when video port 107 is configured as an output port. As shown in FIG. 6b, the Y-region 651 is offset from the U-region 652 by sixteen bytes, and the U-region
652 is further offset from the V-region 653 by eight bytes. In addition, adjacent groups of block interleaved locations are offset by 32 bytes.
Address counter 207 generates the addresses of video FIFO 205 for YUV separation during input mode, and the addresses for interpolation during output mode. FIG. 6c illustrates address generation by address counter 207 for accessing video FIFO
205. As shown in FIG. 6c, address counter 207 comprises a 11-bit counter 620 counting at 60 Mhz. Embedded fields in counter 620 include a 9-bit value C[8:0], and bits "p" and "ex". The positions of these bits in counter 620 are shown in FIG. 6c. The "p" bit, which is the least significant bit of counter 620, represents the two phases of an interpolation operation. These two phases of an interpolation operation correspond to operand loadings into registers 602 and 603 (FIG. 6a) during the (1, 1) interpolation.
During interpolation, every other luminance sample, every other red type chrominance sample (C.sub.r), and every other blue chrominance sample (C.sub.b) are interpolated. FIG. 6d shows, under interpolation mode, the sequence in which stored and interpolated luminance and chrominance samples are output.
Bit C[0] of binary counter 620 counts at 30 Mhz. Since video data samples are received or output at video ports 107 and 108 in pixel interleaved order at 30 MHz, bit C[0] of binary counter 620 indicates whether a luminance sample or a chrominance sample is received or output. Since bit C[1] counts at half the rate of bit C[0], for chrominance samples, bit C[1] indicates whether a C.sub.b or a C.sub.r type chrominance sample is output.
Bits C[8:0] are used to construct the byte address B[8:0] (register 625) for accessing video FIFO 205. Bits B[6:5] indicate which of the four block interleaved groups in video FIFO 205 is addressed. Thus, bits B[8:5] form a "group address". Incrementer 621 receives bits C[8:2] and, during interpolation, increments the number represented by these bits. Bits C[8:2] is incremented whenever the following expression evaluates to a logical true value:
where is the logical operator "and" and is the logical operator "or". Bit "ex" of binary counter 620 indicates an interpolation output. Thus, according to this expression, incrementer 621 increments C[8:2] at one of the two phases of the interpolation operation, every other luminance output, or every other blue or red chrominance output. In this embodiment, when the output sample is not an interpolated output sample, incrementer 621 is disabled. Consequently, both registers 602 and 603
(FIG. 6a) obtain their values from the same byte address. In effect, the same sample is fetched twice, so that each non-interpolated sample is really obtained by performing a 1--1 interpolating using two identical values.
The data output of incrementer 621 is referenced as D[6:0]. As shown in FIG. 6c, the group address B[6:5] is provided by bits D[4:3]. Since a toggle of bit B[4] indicates a jump of 16 byte addresses, bit B[4] can be used to switch, within a block interleaved group, between the luminance and the chrominance regions. Accordingly, bit B[4] adopts the value of negated bit C[0]. In addition, since a toggle of bit B[3] indicates a jump of eight byte addresses, bit B[3] can be used to switch, when a chrominance sample is fetched, between the U and V regions of a block interleaved group. Thus, as shown in FIG. 6c, bit B[3] has the value of bit C[1].
The unregistered value 624 contains a value E[4:0] formed by the ordered combination of bit C[1], bits D[2:0] and the bit which value is provided by the expression ((C[1]p)ex), where is the "exclusive-or" operator. Bits E[4:l] provides the byte address bits B[3:0] during output of a chrominance sample, and bits E[3:0] provides byte address bits B[3:0] during output of a luminance sample. Bit E[0] ensures the correct byte address is output when an "odd" interpolated luminance sample is output. (U+V refer to chrominance pixel types C.sub.b +C.sub.r respectively.)
FIG. 6e shows two adjacent block interleaved groups 630 and 631. Group 630 comprises Y-region 630a, U-region 630b and V-region 630c and group 631 comprises Y-region 631a, U-region 631b and V-region 631c. In FIG. 6e, the labels 1-31 in group 630
represent the positions, in pixel interleaved order, of the pixels stored at the indicated locations of video FIFO 205. Likewise, the labels 32-63 in group 631 represent the positions, in pixel interleaved order, of the pixels stored at the indicated locations. The control structure of FIG. 6c ensures that the proper group addresses are generated when the output sequence crosses over from output samples obtained or interpolated from pixels in group 630 to samples obtained or interpolated from pixels in group 631.
3. The Memory Structure
Internally, chip 100 has six major blocks of memory circuits relating to CPU 150. These memory circuits, which are shown in FIG. 7a, include instruction memory 152, register file 154, Q memory 701 ("QMEM"), SMEM 159, address memory ("AMEM") 706, and P memory 702 ("PMEM"). In addition, a FIFO memory ("VLC FIFO") 703 (not shown) is provided for use by VLC 109 and VLD 110 during the coding and decoding of variable-length codes. A "zig-zag" memory 704 ("Z mem", not shown) is provided for accessing DCT coefficients in either zigzag or binary order. Finally, a window memory 705 ("WMEM", not shown) is provided in motion estimator 111 for storing the current and reference blocks used in motion estimation.
In FIG. 7a, an arithmetic unit 750 represents both ALU 156 and MAC 158 (FIG. 1). Instructions for arithmetic unit 750 are fetched from instruction memory 152. Instruction memory 152 is implemented in chip 100 as two banks of 512.times.32 bit single port SRAMs. Each bank of instruction memory 152 is accessed during alternate cycles of the 60 Mhz system clock. Instruction memory 152 is loaded from global bus 120.
The two 36-bit input operands and the 36-bit result of arithmetic unit 750 are read and written into the 32 general purpose registers R0-R31 of register file 154. The input operands are provided to arithmetic unit 750 over 36-bit input busses
751a and 751b. The result of arithmetic unit 750 are provided by 36-bit output bus 752. (In this embodiment, register R0 is a pseudo-register used to provide the constant zero).
QMEM 701, which is organized as eight 36-bit registers Q0-Q7, shares the same addresses as registers R24-R31. To distinguish between an access to one of registers R24-R31 and an access to one of the registers in QMEM 701, reference is made to a
2-bit configuration field "PQEn" (P-Q memories enable) in CPU 150's configuration register. In this embodiment, registers R0-R23 are implemented by 3-port SRAMs. Each of registers R0-R23 is clocked at the system clock rate of 60 MHz, and provides two read-ports, for data output onto busses 751a and 751b, and one write port, for receiving data from bus 752. Registers R24-R31 are accessed for read and write operations only when the "PQEN" field is set to `00`. The access time for each of registers R0-R23 is 8 nanoseconds. The write ports of registers R0-R31 are latched in the second half period of the 60 Mhz clock, to allow data propagation in the limiting and clamping circuits of arithmetic unit 750.
SMEM 159, which is organized as a 256.times.144-bit memory, serves as a high speed cache between external memory 103 and the register file 154. SMEM 159 is implemented by single-port SRAM with an access time under two periods of the 60 Mhz system clock (i.e. 33 nanoseconds).
To provide higher performance, special register files QMEM 701 and PMEM 702 are provided as high speed paths between arithmetic unit 750 and SMEM 159. Output data of SMEM 159 are transferred to QMEM 701 over the 144-bit wide processor bus 180b). Input data to be written into SMEM 159 are written into PMEM 702 individually as four 36-bit words. When all four 36-bit words of PMEM 702 contain data to be written into SMEM 159, a single write into SMEM 159 of a 144-bit word is performed. SMEM 159
can also be directly written from a 36-bit data bus in "W bus" 180a, bypassing PMEM 702. W bus 180a comprises a 36-bit data bus and a 6-bit address bus. Busses 180a and 180b form the processor bus 180 shown in FIG. 1.
In this embodiment, QMEM 701 is implemented by 3-port 8.times.36 SRAMs, allowing (i) write access on bus 108b as two quad-word (i.e. 144-bit) registers, and (ii) read access on either bus 751a or 751b as eight 36-bit registers. The access time for QMEM 701 is 16 nanoseconds. PMEM 702 allows write access from both W bus 180a and QGMEM 810 (see below). QGMEM 810 is an interface between global bus 120 and processor bus 180a. PMEM 702 is read by SMEM 159 on an 144-bit bus 708 (not shown).
FIG. 7b illustrates in further detail the interrelationships between QMEM 701, PMEM 702, SMEM 159 and registers R0-R31. As shown in FIG. 7b, PMEM 702 receives either 32-bit data on global bus 120, or 36-bit data on W bus 180a. Write decoder 731
maps the write requests on W-bus 180a or global bus 120a into one of the eight 36-bit registers P0-P7. Physically, PMEM 702 is implemented by only four actual 36-bit registers. Each of the registers P0-P3 is mapped into one of the four actual registers. The halfwords of each of registers P4-P7 map into two of the four actual registers. FIG. 7c shows the correspondence between registers P4-P7 and registers P0-P3, which are each mapped into the four actual registers. As shown in FIG. 7c, the higher and lower order halfwords (i.e. bits [31:16] and bits [15:0], respectively) of register P4 are mapped respectively into the lower order halfwords (i.e. bits [15:0]) of register P1 and P0. The higher and lower order halfwords (i.e. bits [31:16] and bits [15:0], respectively) of register P5 are mapped respectively into the higher order halfwords of registers P1 and P0. The higher and lower order halfwords of register P6 are mapped respectively into the lower order halfwords of registers P3 and P2. The higher and lower order halfwords of register P7 are mapped respectively into the higher order halfwords of registers P3 and P2. In this manner, an instruction storing a quad pel (4 by 16-bits) into registers P4 and P5, or registers P6 and P7
would also have transposed the quad pel prior to storing the quad pel into SMEM 159. In conjunction with the "quarter turn" memory (described below), registers P4-P7 provides a means for writing a macroblock of pixels in column or row order and reading the macroblock back in the corresponding row or column order.
PMEM 702 is read only by the StoreP instruction, and stores over bus 708 the four actual registers as a 144-bit word into SMEM 159. The 144-bit word stored into SMEM 159 is formed by concatenating the contents of the four actual registers, in the order of corresponding registers P0-P3.
Thirty-two 36-bit locations in SMEM 159 are each provided two addresses. These addresses occupy the greatest 64 (36-bit word) addresses of SMEM 159's address space. The first set of addresses ("direct addresses"), at hexadecimal 3c0-3df), are mapped in the same manner as the remaining lower 36-bit locations of SMEM 159. The second set of addresses ("alias addresses"), at hexadecimal 3e0-3ff, are aliased to the direct addresses. The mappings between the direct and the alias addresses are shown in FIG. 7d. The aliases are assigned in such a way that, if a macroblock is written in row order into these addresses, using the second set of addresses and using registers P4-P7 of PMEM 702, and read back in sequential order using the first (direct) address, the macroblock is read back in column and row transposed order. Since the present embodiment performs 2-dimensional DCT or an IDCT operation on a macroblock in two passes, one pass being performed in row order and the other pass being performed in column order, these transpose operations provide a highly efficient mechanism of low overhead to perform the 2-dimensional DCT or IDCT operation.
As shown in FIG. 7b, SMEM 159 can also be written directly from W bus 180a, thereby bypassing PMEM 702. Multiplexers 737a-737d selects as input data to SMEM 159 between the data on bus 708 and W bus 180a. Drivers 738 are provided for writing data into SMEM 159. Decoder 733 decodes read and write requests for access to SMEM 159.
An address memory ("AMEM") 706, which is implemented as an 8.times.10 bit SRAM, stores up to eight memory pointers for indirect or indexed access of SMEM 159 at 36-bit locations. An incrementer 707 is provided to facilitate indexed mode access of SMEM 159.
Zigzag memory 704 and window memory 705 are described below in conjunction with VLC 109 and motion estimator 111.
4. Memory Controller 104
Chip 100 accesses external memory 103, which is implemented by dynamic random access memory (DRAM). Controller 104 supports one, two or four banks of memory, and up to a total of eight megabytes of DRAM.
Memory controller 104 manages the accesses to both external memory 103 and the internal registers. In addition, memory controller 104 also (a) arbitrates requests for the use of global bus 120 and W bus 180a; (b) controls all transfers between external memory 103 and the functional units of chip 100, and (c) controls transfers between QG registers ("QGMEM") 810 and SMEM 159. FIG. 8 is a block diagram of memory controller 104. QGMEM 810 is a 128-bit register which is used for block transfer between 144-bit SMEM 159 and 32-bit global bus 120. Thus, for each transfer between QGMEM 810 and SMEM 159, four transfers between global bus 120 and QGMEM 801 would take place. A guard-bit mechanism, discussed below, is applied when transferring data between QGMEM 810 and SMEM 159.
As shown in FIG. 8a, an arbitration circuit 801 receives requests from functional units of chip 100 for data transfer between external memory 103 and the requesting functional units. Data from external memory 103 are received into input buffer
811, which drives the received data onto global bus 120. The requesting functional units receive the requested data either over global bus 120, or over processor bus (i.e. W bus) 180a in the manner described below. Data to be written into external memory 103 are transferred from the functional units over either w bus 180a or global bus 120. Such data are received into a data buffer 812 and driven on to memory data bus 105a.
W bus 180a comprises a 36-bit data bus 180a-1 and a 6-bit address bus 180a-2. The address and data busses 180a-1 and 180a-2 are pipelined so that the address on address bus 180a-2 is associated with the data on data bus 180a-2 in the next cycle. The most significant bit of address bus 180a-2 indicates whether the operation reads from a register of a functional unit or writes to a register of a functional unit. The remaining bits on address bus 180a-2 identify the source or destination register. Additional control signals on W bus 180a are: (a) isW.sub.-- bsy (a signal indicating valid data in the isWrite Register 804), (b) Wr.sub.-- isW (a signal enabling a transfer of the content of data bus 180a-1 into isWrite Register 804), (c) req.sub.-- W5.sub.-- stall (a signal requesting W bus 108a 5 cycles ahead), and (d) Ch1.sub.-- busy (a signal to indicate that channel 1, which is RMEM 154, is busy).
In memory controller 104, a channel memory 802 and an address generation unit 805 control DMA transfers between functional units of chip 100 and external memory 103. In the present embodiment, channel memory has eight 32-bit registers or entries, corresponding to 8 assigned channels for DMA operations. To initiate a DMA access to external memory 103 or an internal control register, the requesting device generates an interrupt to have CPU 150 write, over W bus 180a, a request into the channel memory entry assigned to the requesting device. The portion of external memory 103 accessed by DMA can be either local (i.e. in the address space of the present chip) or remote (i.e. in the address space of another chip).
In the present embodiment, channel 0 is reserved for preforming refresh operations of external memory 103. Channel 1 allows single-datum transfer between external memory 103 and RMEM 154. Channel 2 is reserved for transfers between host interface 102 and either external memory 103 or internal control registers. FIGS. 8b and 8d provide the bit assignment diagrams for channel memory entries of channels 1 and 2 respectively. Channels 3-7 are respectively assigned to data transfers between either external memory 103, or internal control registers, and (a) video bus 107, (b) video bus 108, (c) VLC FIFO 703 of VLC 109 and VLD 110, (d) SMEM 159, and (e) instruction memory 152. FIG. 8c provides the bit assignment diagrams of the channel memory entries of channels 0 and 3-7.
For all channel entries, bit 0 indicates whether the requested DMA access is a read access or a write access. In the channel memory entry of channel 1 (FIG. 8b), bits 31:24 are used to specify ID of a "remote" chip, when the address space of the remote chip is accessed. If access to the address space of a remote chip is requested, bit 1 is also set. In the channel memory entry of channel 1, bit 23 indicates whether the DMA access is to external memory 103 or to a control register of either global bus 120 or W bus 180a. When the access is to a control register of W bus 180a, bit 21 is also set. For channels 0, 3-7, bits 31:23 provide a count indicating the number of 32-bit words to transfer. For channels 3 and 4 (video buses 107 and
108), the count is a multiple of 16. For channel 6 (SMEM 159), the count is a multiple of 4.
Referring back to FIG. 8a, external DRAM controller 813 maps the addresses generated by address generation unit 805 into addresses in external memory 103. DRAM controller 813 provides conventional DRAM control signals to external memory 103. The output signals of DRAM controller 813 are provided on memory address bus 105b.
In this embodiment, a word in external memory 103 or on host bus 101 is 32-bit long. However, in most internal registers, and on W bus 180a, a data word is 36-bit long. To save the four bits not transferred to external memory 103, or host bus
101, a guard-bit register stores the data bits 35:32 that are driven onto global bus 120. For data received from a 32-bit data source, the "Inbit" field of the guard bit register supplies the missing four bits.
A priority interrupt encoding module 807 receives interrupt requests from functional units and generates interrupt vectors according to a priority scheme for CPU 150 to service. An interrupt is generated whenever a channel in channel memory 802
is empty and the channel's interrupt enable bit (stored in an interrupt control register) is set. In this embodiment, the interrupt vector is 4-bit wide to allow encoding of 16 levels of interrupt.
Transactions on global bus 120 are controlled by a state machine 804. Global bus 120, which is 32-bit wide, is multiplexed for address and data. Two single-bit signals GDATA and GVALID indicate respectively whether data or address is placed on global bus 120, and whether valid data or address is currently on global bus 120. Additional single-bit control signals on global bus 120 are IBreq (video input port requests access to external memory), OBreq (video output requests access to external memory), VCreq (VLC requests access to external memory), VDreq (VLD requests access to external memory), IBdmd (Video input is demanding access to external memory), and OBdmd (video output is demanding access to external memory).
During a valid address cycle, memory controller 104 drives an address onto global bus 120. In such an address, bit 6 (i.e. the seventh bit from the least significant end) of the 32-bit word is an "read or write" bit, and indicates whether the bus access reads from or write to global bus 120. The six bits to the right of the "read or write" bit constitute an address. By driving an address of a functional unit on to global bus 120, memory controller 104 selects the functional unit for the access. Once a functional unit is selected, the selection remains until a new address is driven by memory controller 104 on to the global bus. While selected, the functional unit drives output data or reads input data, according to the nature of the access, until either the GVALID signal is deasserted, or the GDATA signal is negated. The negated GDATA signal signifies a new address cycle in the next system clock period.
An arbitration scheme allows arbitration circuit 801 to provide fairness between non-real time channels, such as SMEM 159, and real-time channels, such as video ports 107 and 108, or VLC 109. In general, a channel memory request from a functional unit is pending when (a) a valid entry of the functional unit is written in channel memory 802, (a) the mask bit (see below) of the functional unit in an enable register for the request is clear, and (c) the functional unit's request signal is asserted. For channels 3 and 7 (i.e. SMEM 159 and instruction memory 152), a request signal is not provided, and a valid entry in channel memory 802 suffices.
In this embodiment, the real-time channels have priority over non-real time channels. Arbitration is necessary when more than one request is pending, and occurs after memory controller 104 is idle or has just finished servicing the last request. In this embodiment, each non-real time channel, other than RMEM, is provided with a mask bit which is set upon a completion of request, if another non-real time request is pending. All of the non-real time mask bits are cleared when no non-real time request is outstanding. Real time channels are not provided with mask bits. Thus, a real time channel request can always proceed, unless preempted by a higher priority request. DRAM refresh is the highest priority real time channel.
An exception to the rule that priority of a real time channel over a non-real time channel occurs when the mask bit for RMEM operation is clear and an RMEM operation (i.e. load or store operation) becomes pending. Under this exception, memory controller 104 allows an ongoing request to be interrupted in favor of the RMEM operation. If a second RMEM operation becomes pending prior to the completion of the first RMEM operation, the second RMEM operation is also allowed to proceed ahead of the interrupted request. Up to three such preemptive RMEM operations are allowed to proceed ahead of an interrupted request. Thereafter, memory controller 104 sets the mask bit for an RMEM operation, and the interrupted request is allowed to resume and proceed to completion.
IsWrite register 804 and isRead register 805 are registers provided to support store and load operations of internal registers (i.e. registers in RMEM 154) to and from external memory 103. During a load operation, CPU 150 writes over W bus 180a a request into channel 1 of channel memory 802. When memory controller 104 begins to service the requested load operation, memory controller 104 asserts the "req.sub.-- W5.sub.-- stall" signal to reserve five cycles ahead a slot for the use of W bus
180a. When the requested data is received from DRAM, the data is driven on to global bus 120. At the same time, channel memory 802 asserts the signal Rd.sub.-- isR signal, which latches into isRead register 805 the data on global bus 120. In the following cycle, the content of the isRead register 805 is driven onto the W bus 180a and latched into the specified destination in RMEM 154 to complete the load operation.
In a store operation, data from RMEM 154 is driven onto W bus 180a, which is latched by IsWrite register 804. In the following cycle, CPU 150 writes a channel request into channel 1 in channel memory 802 over W bus 180a. Memory controller 104
asserts signal isW.sub.-- Bsy to indicate valid data in isWrite register 804 and to prevent CPU 150 from overwriting isWrite register 804. When memory controller 104 is ready to service the store request, the isW.sub.-- Bsy signal is deasserted and the content of isWrite register 804 is driven onto global bus 120 in the following cycle. The data is latched into output buffer 812 for storing into external memory 103 over memory data bus 105a.
The present embodiment supports up to a total of 8 megabytes of external DRAM. FIG. 9a shows a configuration 900 in which external memory 103 is a 4-bank memory interfaced to chip 100. To support this configuration, chip 100 provides two "row address strobe" (RAS) signals 908 and 909, and two column address strobe (CAS) signals 906 and 907. RAS signals 908 and 909, CAS signals 906 and 907 are also respectively known as RAS.sub.-- 1 and RAS.sub.-- 0, and CAS.sub.-- 1 and CAS.sub.-- 0 signals.
Memory bus 105 comprises a 32-bit data bus 105a and an 11-bit address bus 105b. To support scan-line mode accesses, discussed below, two output terminals are provided in chip 100 for word address bit 1 (i.e. byte address 3, or A3). Thus, address bus 105b is effectively 10-bit wide. As shown in FIG. 9a, four banks 901-904 of DRAM are configured such that bank 901 receives address strobe signals RAS0 and CAS0, bank 902 receives address strobe signals RAS.sub.-- 0 and CAS.sub.-- 1, bank
903 receives address strobe signals RAS.sub.-- 1 and CAS.sub.-- 1, bank 904 receives address strobe signals RAS.sub.-- 1 and CAS.sub.-- 0.
External memory 103 supports both interleaved and non-interleaved modes. In non-interleaved mode, only two banks of memory are accessed, using both RAS signals and one (CAS.sub.-- 0) CAS signal. Thus, in non-interleaved mode, banks 902 and 903
are not accessed. Under one mode of interleaved DRAM access, banks 0 and 2, both receiving the signal CAS.sub.-- 0, form an "even" memory bank, while banks 1 and 3, both receiving the signal CAS.sub.-- 1, form the "odd" memory bank. In the present embodiment, address bit 2, which is used to generate the signals CAS.sub.-- 0 and CAS.sub.-- 1, distinguishes between the odd and even banks.
Interleaved access to external memory 103 is desirable because of the efficiency inherent in overlapping memory cycles of the interleaved memory banks. However, the manner in which data is accessed determines whether such efficiency can be achieved. Generally speaking, with respect to the location of pixels on a video image, chip 100 fetches video data in two different orders: "scan-line" mode, or "reference mode". Under scan-line mode, the access pattern follows a line by line access of the pixels of a display. Under reference mode, pixels are accessed column by column. To support scan-line mode, each bank of memory is divided into two half-banks, each half-bank receiving independently the signal on one of chip 100's two terminals for word address bit 1. In scan-line mode, under certain conditions described below, these two terminals may carry different logic levels to result in a different word address being access in each half-bank.
FIG. 9b is a timing diagram showing interleaved accesses to data in the odd and even banks of FIG. 9a. In FIG. 9b, two page mode read operations and two page mode write operations are performed in each of the odd and even banks. The protocol shown in FIG. 9b is for reference mode access, and is not suitable for use under scan-line mode. This is because, under interleaved reference mode, the same column address is used to access both the even and odd banks. Consequently, as shown in FIG.
9a, chip 100 generates a single address, which is latched by address latch 905, for both the odd and even banks. However, under interleaved scan-line mode, separate column addresses are generated for the even and odd banks.
In configuration 900, signal CAS.sub.-- 1 turns off address latch 905 to keep the column address stable for the odd memory bank. In FIG. 9b, the bus name "Address" represents the signals on memory address bus 105b. The designation "RAr" "CAr12" and "CAr34" represents respectively (a) a row address, (b) a column address for data R1 and R2 and (c) a column address for data R3 and R4. The arrivals of the data signals at the even and odd banks are illustrated by the signals "DATA0" and "DATA1" respectively.
In the example illustrated by FIG. 9b, the same column address is used to access data words R1 and R2 and a different column address is used to access data words R3 and R4. Column address CAr12 is latched two cycles apart into the even and odd banks at times t.sub.1 and t.sub.2, respectively. Likewise, column address CAR34 is latched into even and odd memory banks at times t.sub.3 and t.sub.4 respectively. The address of the destination, and data words R1, R2, R3 and R4 are driven onto global bus 120 (the signals represented by "GDATA") at consecutive cycles in FIG. 9b.
FIG. 9b also shows an interleaved write access, using the same column address "CAw23" (i.e. the column address for data W2 and W3), which is latched at times t.sub.6 and t.sub.7 (i.e. separated by two clock cycles), into the even and odd banks of configuration 900. Again, the protocol in FIG. 9b is used under reference mode, but is not suitable for scan-line mode access.
FIG. 9c is a timing diagram showing interleaved access of the memory system in configuration 900 under scan-line mode, where the column address for consecutive data words are different. In FIG. 9c, the column addresses for data words R1-R4, represented by "CAr1", "CAr2", "CAr3" and "CAr4", are separately provided at least 4 clock cycles apart. Data words R1 and R3 are stored in the odd memory bank, and data words R2 and R4 are stored in the even memory bank. Both column address strobe signals CAS.sub.-- 0 and CAS.sub.-- 1 are asserted once every six clock cycles. The time period between assertions of the signals CAS.sub.-- 0 and CAS.sub.-- 1 is four clock cycles.
Memory controller 104 generates addresses for accesses to external memory 103. To efficiently support both the fetching of reference frames, during motion estimation, and the scan-line mode operation, during video data input and output, two pixel arrangements are used to stored video data in external memory 103. The first arrangement, which supports scan-line mode operation is shown in FIG. 10a. The second arrangement, which supports reference frame fetching during motion estimation, is shown in FIG. 10b.
FIG. 10a shows an arrangement 1000a which supports scan-line mode operation. In the present embodiment, each access to external memory 103 fetches a 32-bit word comprising four pixels. In external memory 103, a 32-bit data word is used to store four pixels arranged in a "quad pel", i.e. the four pixels are arranged in a 2.times.2 pixel configuration on the screen. Under scan-line mode, however, the pixels desired are four adjacent pixels on the same scan line. Thus, under scan-line mode, the four pixels fetched are taken from two data words in external memory 103.
In FIG. 10a, the pixels, each represented by a symbol Pxy, are labelled according to the positions they appear on a display screen, i.e. `Pxy` is the label given to the pixel at row x and column y. Under the label Pxy of each pixel is a hexadecimal number which represents the byte address (offset from a base address) of the pixel as it is stored in external memory 103. For example, the quad pel comprising pixels P00, P01, P10, and P11 is stored at word address 0 (hexadecimal), which includes the byte addresses 0-3. As a matter of convention, in the following detailed description, the term "quad pel Pxy" is understood to mean the quad pel in which the upper left pixel is labelled Pxy.
FIG. 10a also illustrates a collective term for a number of pixels called a "tile". A "tile" comprises four quad pels arranged in a 2.times.2 configuration. For example, the square area defined by quad pels P00, P02, P20 and P22 is a tile. As a matter of convention, in the following detailed description, the term "tile Pxy" is understood to mean the tile in which the quad pel at its upper left hand corner is quad pel Pxy. As mentioned above, under scan-line mode access, four horizontally adjacent pixels are accessed at a time. Again, as a matter of convention, in the following discussion, the term "scan line Pxy" is understood to mean the group of four horizontally adjacent pixels which left most pixel is Pxy.
In arrangement 1000a, each tile is stored in four consecutive words of external memory 103. For example, tile P00 are stored consecutive memory words which addresses 0, 4, 8 and C (big Endian format). In addition, within each word is stored a quad pel. In the present embodiment, the odd memory bank has addresses which bit 2 has bit value `1` and the even memory bank has addresses which bit 2 has bit value `0`. Thus, for example, both quad pels P00 and P02 are stored in the even bank, and quad pels P20 and P22 are stored in the odd bank.
In arrangement 1000a, the order in which the upper and the lower halves of a quad pel is stored is determined by bit 3 of the memory address. By convention, the upper half of a quad pel refers to the two pixels of the quad pel occupying the "higher" screen positions. For example, since bit 3 of the word address (=0) of quad pel P00 has bit value `0`, the upper halfword stores the lower half of quad pel P00 (i.e. pixels P10 and P11), and the lower halfword stores the upper half of quad pel P00 (i.e. pixels P00 and P01). As used here, the upper halfword refers to the half of the data word having the greater byte addresses. However, since bit 3 of the byte address (=8) of quad pel P02 has the bit value `1`, the upper halfword (i.e. addresses A and B) stores the upper half of the quad pel P02 (i.e. pixels P02 and P03), while the lower half of quad pel P02 (i.e. P12 and P13) is stored in the lower halfword (addresses 8 and 9). As explained below, this alternative pattern of swapping the upper and lower halves of the quad pel every other memory word supports the scan-line access mode.
In addition, to support scan-line mode, the upper and lower halves of the memory word are independently addressed. Specifically, under scan-line mode, bit 3 in the column address provided to access each half of the memory word is different. This is accomplished by providing a different value on two word address bit 1 output terminals (i.e. A3) of chip 100. For example, when fetching the scan line P00, the upper halfword retrieves from address 8 (i.e. bit 3 of byte address 0 toggled) pixels P02 and P03, and the lower halfword retrieves from word address 0 pixels P00 and P01. In arrangement 1000a, both halfwords in each 4-pixel scan line fetch are retrieved from the same even or odd memory bank.
Memory controller 104 provides the address translation necessary to translate the address from CPU 150 ("logical address" or "LA") to the address actually provided to each halfword in each memory bank ("physical address" or "PA"). Since byte address bits PA[1:0] are not involved in addressing in external memory 103, which receives only word addresses, mapping between logical addresses and physical addresses in these bits are provided by byte swapping in memory controller 104.
Specifically, under arrangement 1000a, when a quad pel is fetched for a non-scan line access, only one address bit is translated to ensure the upper and lower halves of the quad pel are swapped when the logical byte address bit LA[3] is `1`. The mapping memory controller 104 generates maps the logical address to the the physical address according to the following equations:
where PA[1] is bit 1 of the physical byte address, and LA[3] and LA[1] are the bits 3 and 1 of the logical byte address. The operator is the "exclusive-OR" operator. In this instance, the physcical address provided to both halfwords of the memory bank addressed are the same.
The logical addresses of the pixels under scan-line mode are shown in FIG. 10c. The logic circuit in memory controller 104 generates the physical address according to the following equations:
Thus, under scan-line mode, memory controller 104 (a) accesses (i) in an even scan line (i.e. scan line Pny, where n is even), the left half of the scan line in the
lower halfword, and the right half of the scan line in an upper halfword; (ii) in an odd scan line (i.e. scan line Pny, where nis odd), the left half of the scan line in the upper halfword and the right half of the scan line in the lower halfword; (b) switches, every two scan lines, between accessing the odd memory bank to accessing the even memory bank; (c) accesses, for the right half of a scan line, a halfword which physical byte address is offset by 8 from the physical byte address of the halfword containing the left half of the scan line (i.e. different values for the two address bits A3 of chip 100).
Arrangement 1000b shown in FIG. 10b supports reference fetch accesses. The logical addresses for a reference frame are shown in FIG. 10d. Under this arrangement, a tile is fetched by fetching the four quad pels in the order of top-left, top-right, bottom-left and bottom-right. In fetching a reference macroblock, tiles are fetched column by column and, within a column, from top to bottom. For example, in FIG. 10b, tile P00 is fetched in the order of quad pels P00, P02, P20 and P22. The reference frame is fetched by fetching tiles P00, P40, P80, PC0, P04, P44, P84, PC4 . . . etc. To take advantage of the efficiencies of memory interleaving and page mode accesses, arrangement 1000b is arranged such that the top-left quad pel and the bottom-left quad pel are located in the even memory bank, and the top-right and bottom-right quad pels are located in the odd memory bank.
To minimize delay due to page crossings during a reference frame fetch, memory controller fetches all the tiles of the reference frame in the upper DRAM page before fetching the tiles in the lower DRAM page. FIG. 10e illustrates a reference frame fetch which crosses a memory page boundary.
FIG. 10e shows four tiles 1050a-1050d of a reference frame. In each quad pel of each tile, the hexadecimal numbers at the four corners of the quad pel are physical byte addresses at which the four pixels of the quad pel are stored. For example, the four pixels of quad pel 1 of tile 1050d are stored at physical byte addresses 7E, 7F, 7C and 7D. In FIG. 10e, the DRAM page boundary is between the upper half-tile and the lower half-tile in each of the tiles 1050c and 1050d shown in FIG. 10e. If a reference fetch starts at address 28, the page boundary is encountered after fetching the quad pel 1 of tile 1050c, which is located at physical byte address 3C. At that point, detecting the page boundary, memory controller 104 generates address 68
rather than x0 to fetch the remaining quad pels of the tiles in the upper DRAM page, rather than crossing over to the lower DRAM page. According to arrangement 1000b of FIG. 10b, in a reference frame access, address 68 is in the same memory bank as address 38 and in the opposite memory bank of address 3C. Consequently, in making the jump from address 3C to address 68, interleaved access is not interrupted.
As mentioned above, data transfers between SMEM 159 and external memory 103 take place through QGMEM 810 and global bus 120. FIGS. 11a and 11b are timing diagrams showing respectively the data transfers from external memory 103 to SMEM 159, and from SMEM 159 to external memory 103. As mentioned above, the data bus portion of global bus 120 is 32-bit, and the interface between QGMEM 810 and SMEM 159 is 128-bit. A 2-bit signal bus Qptr is provided to indicate which of the four 32-bit words ("QG registers") in QGMEM 810 is the source or destination of the 32-bit data on global bus 120. A 1-bit signal "req.sub.-- smem.sub.-- stall" indicates two cycles ahead an impending access by QGMEM 810 to SMEM 159, to prevent CPU 150 from accessing SMEM 159
while the QGMEM access is performed.
As shown in FIG. 11a, at cycles 1 and 2, a request for DMA data transfer is written into channel memory entry 6 to signal a data transfer from external memory 103 to the SMEM 159. As each 32-bit word is received on memory data bus 105a, memory controller 104 drives the data word onto global bus 120. For example, datum D0 is driven onto global bus 120 during cycles 5 and 6. In this example, the first 32-bit datum is scheduled to be written to the first of four QG registers of QGMEM 810. The destination in QGMEM 810 for datum DO is indicated in cycles 3 and 4 in the 2-bit Qptr signal bus. The asserted "qgreq" signal enables data on global bus 120 to be written into QGMEM 810. Thus, datum D0 is written into QGMEM 810 during cycles 5 and 6. Datum D1 is likewise written into QG register 810 during cycles 7 and 8. A transfer between QGMEM 810 and SMEM 159 is signalled two cycles ahead by asserting "q.sub.-- smem.sub.-- stall", which is usually asserted in an external memory to SREM 159
transfer when QGMEM 810 holds three valid data not already written into SMEM 159, and the fourth datum is currently on global bus 120, e.g. in cycle 14. During cycle 15, all four QG registers of QGMEM 810 are written into SMEM 159.
FIG. 11b shows a transfer between SMEM 159 to external memory 103. During cycles 1 and 2, a transfer request is written into channel memory entry 6 to signal a block memory transfer from SMEM 159 to external memory 103. In this example, the four QG registers of QGMEM 810 have been previously loaded from SMEM 159. The 2-bit QGptr signal selects which of the four QG registers of QGMEM 810 is active. While qgreq is asserted, the data in the 32-bit register of QGMEM 810 corresponding to the value of QGptr are driven onto global bus 120. In this example, data D0 and D1 are driven onto global bus 120 during cycles 5, 6, 7 and 8. A data transfer between QGMEM 810 and SMEM 159 is signalled three cycles ahead by asserting the signal "q.sub.-- smem.sub.-- stall", which is usually asserted in an SREM 159 to external memory transfer when QGMEM 810 holds only on