United States Patent6587590
PanJuly 1, 2003

Title

Method and system for computing 8.times.8 DCT/IDCT and a VLSI implementation

Abstract

A method and system for computing 2-D DCT/IDCT which is easy to implement with VLSI technology to achieve high throughput to meet the requirements of high definition video processing in real time is described. A direct 2-D matrix factorization approach is utilized to compute the 2-D DCT/IDCT. The 8.times.8 DCT/IDCT is computed through four 4.times.4 matrix multiplication sub-blocks. Each sub-block is half the size of the original 8.times.8 size and therefore requires a much lower number of multiplications. Additionally, each sub-block can be implemented independently with localized interconnection so that parallelism can be exploited and a much higher DCT/IDCT throughput can be achieved.


Inventors:Pan; Feng (Cupertino, CA)
Assignee:The Trustees of the University of Pennsylvania (Philadelphia, PA)
Appl. No.:402367
Filed:February 22, 2000
PCT File Date:February 2, 1999
PCT No:PCT/US99/02186
PCT Pub Date:August 5, 1999
PCT Pub No:WO99/39303

Current U.S. Class:382/250 708/402 
Current International Class:G06F 17/14 (20060101)
Field of Search:382/250,233,276,277,280 708/402,405,404

U.S. Patent Documents
5309527May 1994Ohki
5539836July 1996Babkin
5668748September 1997Huang
Primary Examiner: Tran; Phuoc
Attorney, Agent or Firm:Sterne, Kessler, Goldstein & Fox P.L.L.C.

Parent Case Text



This application claims the benefit of provisional application Ser. No. 60/073,367 filed Feb. 2, 1998.

Claims


What is claimed is:
1. A system for computing a two-dimensional discrete cosine transform (2D-DCT) of input data, the input data including at least first through sixty-fourth input elements of an
8 row.times.8 column matrix X, comprising: a shuffler having four parallel inputs and four parallel outputs, each input receiving a respective group of sixteen rows of the first through 64th elements, wherein said shuffler processes each row of four parallel input elements (a,b,c,d) received on said four inputs in parallel and outputs four output elements (x++,x-+,x+-,x--) in parallel for each processed input row, each output element representing a different linear combination of four corresponding input elements; and first through fourth sub-block operators (EE, EO, OE, OO) each coupled in parallel to a respective one of said four outputs of said shuffler; wherein said first through fourth sub-block operators each process a set of sixteen output elements from said shuffler independently, and generate respective first through fourth sets of sixteen matrix products, each set of matrix products representing a product of three independent 4.times.4 matrix multiplications of a respective set of said sixteen output elements output from said shuffler.

2. The system of claim 1, wherein said shuffler calculates first through fourth output elements for each row of first through fourth input elements such that: said first output element (x++) equals a sum of said first through fourth input elements (a+b+c+d); said second output element (x-+) equals a sum of said first input element minus said second input elements plus said third input element and minus said fourth input element (a-b+c-d); said third output element (x+-) equals a sum of said first input element plus said second input elements minus said third input element and minus said fourth input element (a-b+c-d); and said fourth output element (x--) equals a sum of said first input element minus said second input elements minus said third input and plus said fourth input element (a-b-c+d).

3. The system of claim 2, wherein said shuffler comprises first through fourth adders and first through fourth subtractors interconnected in two layers, each adder and subtractor has two inputs and an output; wherein: said first adder inputs receive said first and second input elements and said first adder output is coupled to said third adder and said third subtractor; said first subtracter inputs receive said first and second input elements and said first subtracter output is coupled to said fourth adder and said fourth subtractor; said second adder inputs receive said third and fourth input elements and said second adder output is coupled to said third adder and said third subtractor; and said second subtracter inputs receive said third and fourth input elements and said second subtracter output is coupled to said fourth adder and said fourth subtractor.

4. The system of claim 1, wherein said shuffler includes at least four adders and four subtractors.

5. The system of claim 1, wherein said first sub-block operator (EE subblock) outputs said first set of sixteen matrix products equal to a 4.times.4 matrix Z1; said second sub-block operator (EO subblock) outputs said second set of sixteen matrix products equal to a 4.times.4 matrix Z2; said third sub-block operator (OE subblock) outputs said third set of sixteen matrix products equal to a 4.times.4 matrix Z3; and said fourth sub-block operator (OO subblock) outputs said fourth set of sixteen matrix products equal to a 4.times.4 matrix Z4; where ##EQU90## 4.times.4 matrix E has only odd coefficients of said coefficient vector W and 4.times.4 matrix O has only odd coefficient of said coefficient vector W as follows: ##EQU91## where said coefficient vector W consists of coefficients w.sub.k is proportional to cos(k.pi./16) for k=1,2, . . . 7.

6. The system of claim 5, wherein: said first sub-block operator (EE subblock) comprises six multipliers; sixteen accumulators; and a 6 to 16 multiplexer controlled by a first mux-selector signal, said 6 to 16 multiplexer being coupled between each of said six multipliers and each of said sixteen accumulators.

7. The system of claim 6, wherein: said second sub-block operator (EO subblock) comprises twelve multipliers; sixteen accumulators; and a 12 to 16 multiplexer controlled by a second mux-selector signal, said 12 to 16 multiplexer being coupled between each of said twleve multipliers and each of said sixteen accumulators; said third sub-block operator (OE subblock) comprises twelve multipliers; sixteen accumulators; and a 12 to 16 multiplexer controlled by a third mux-selector signal, said
12 to 16 multiplexer being coupled between each of said twelve multipliers and each of said sixteen accumulators; and said fourth sub-block operator (OO subblock) comprises ten multipliers; sixteen accumulators; and a 10 to 16 multiplexer controlled by a fourth mux-selector signal, said 10 to 16 multiplexer being coupled between each of said ten multipliers and each of said sixteen accumulators.

8. The system of claim 7, wherein each multiplier comprises a psuedo-multiplier.

9. The system of claim 1, further comprising: first through fourth output stages coupled to receive outputs from said first through fourth sub-block operators, respectively; each of said first through fourth output stages comprising: a plurality of latches; a 16 to 1 multiplexer; and a clipper; said 16 to 1 multiplexer being coupled between each latch and said clipper.

10. The system of claim 9, wherein said clipper comprises a truncation and saturation control unit.

11. The system of claim 1, wherein said first through fourth sub-block operators each have a 13-bit coefficient quantization and a 15-bit finite internal wordlength.

12. The system of claim 1, wherein the input data comprises video data compressed according to at least one of a MPEG and JPEG standard.

13. A system for computing a two-dimensional inverse discrete cosine transform (2D-IDCT) of input data, the input data including at least first through sixty-fourth input elements of an 8 row.times.8 column matrix X.sub.ij, where
0.ltoreq.i,j.ltoreq.3, comprising: a multiplexer that divides the input data into first to fourth 4.times.4 sub-matrices Xee, Xoe, Xeo, and Xoo based on whether each element has an even or odd row and column coefficient such that 16 elements having an even row and an even column are included in sub-matrix Xee, 16 elements having an odd row and even column are included in sub-matrix Xoe, 16 elements having an even row and odd column are included in sub-matrix Xeo, and 16 elements having an odd row and odd column are included in sub-matrix Xoo; and first through fourth sub-block operators (EE, EO, OE, OO) receiving said first to fourth 4.times.4 sub-matrices Xee, Xoe, Xeo, and Xoo, respectively, said sub-block operators processing said first to fourth
4.times.4 sub-matrices Xee, Xoe, Xeo, and Xoo independently and generating respective first through fourth sets of sixteen matrix products, each set of matrix products representing a product of three independent 4.times.4 matrix multiplications of a respective set of said sixteen input elements in said input data.

14. The system of claim 13, further comprising: a shuffler having four parallel inputs and four parallel outputs, each input coupled in parallel to receive an output from a respective one of said first through fourth sub-block operators.

15. The system of claim 14, wherein said shuffler includes at least four adders and four subtractors.

16. The system of claim 14, wherein said shuffler outputs 64 elements of an output matrix Z representing a 2D IDCT transform of the input data of matrix X.

17. The system of claim 16, wherein output matrix Z includes four 4.times.4 submatrices Z5 to Z8, said shuffler outputting first to fourth sets of 16 elements in the submatrices Z5 to Z8 on said four outputs of said shuffler in parallel, where said submatrices Z5 to Z8 are defined by: ##EQU92## 4.times.4 matrix E has only even coefficients of a coefficient vector W and 4.times.4 matrix O has only odd coefficients of said coefficient vector W as follows: ##EQU93## where said coefficient vector W consists of coefficients w.sub.k is proportional to cos(k.pi./16) for k=1,2, . . . 7.

18. The system of claim 13, wherein: said first sub-block operator (EE subblock) comprises six multipliers; sixteen accumulators; and a 6 to 16 multiplexer controlled by a first mux-selector signal, said 6 to 16 multiplexer being coupled between each of said six multipliers and each of said sixteen accumulators; said second sub-block operator (EO subblock) comprises twelve multipliers; sixteen accumulators; and a 12 to 16 multiplexer controlled by a second mux-selector signal, said 12
to 16 multiplexer being coupled between each of said twleve multipliers and each of said sixteen accumulators; said third sub-block operator (OE subblock) comprises twelve multipliers; sixteen accumulators; and a 12 to 16 multiplexer controlled by a third mux-selector signal, said 12 to 16 multiplexer being coupled between each of said twelve multipliers and each of said sixteen accumulators; and said fourth sub-block operator (OO subblock) comprises ten multipliers; sixteen accumulators; and a
10 to 16 multiplexer controlled by a fourth mux-selector signal, said 10 to 16 multiplexer being coupled between each of said ten multipliers and each of said sixteen accumulators.

19. The system of claim 18, wherein each multiplier comprises a psuedo-multiplier.

20. The system of claim 13, further comprising: first through fourth output stages coupled between said first through fourth sub-block operators, respectively, and inputs of said shuffler, each of said first through fourth output stages comprising: a plurality of latches; and a 16 to 1 multiplexer; and first through fourth clippers coupled to said first through fourth outputs of said shuffler.

21. The system of claim 20, wherein each clipper comprises a truncation and saturation control unit.

22. The system of claim 13, wherein said first through fourth sub-block operators each have a 13-bit coefficient quantization and a 16-bit finite internal wordlength.

23. The system of claim 13, wherein the input data comprises video pictures compressed according to at least one of a MPEG and JPEG standard.

24. A hybrid 2D-DCT and IDCT system that receives DCT input data and IDCT input data comprising: a first input multiplexer having a first input that receives the DCT input data and four outputs; a shuffler having four inputs coupled to said four outputs of said first input multiplexer and an output; a second input multiplexer having a first input coupled to said output of said shuffler and four outputs; and four sub-block operators, each having an input coupled in parallel to a respective output of said second input multiplexer and each sub-block operator having sixteen outputs.

25. The hybrid system of claim 24, further comprising: four DCT clippers; four IDCT clippers; latches coupled to each sub-block operator output; and an output multiplexer having inputs coupled to each of said latches and outputs coupled to said first input multiplexer and to said four DCT clippers; and wherein outputs of said shuffler are also coupled to said four IDCT clippers.

26. The system of claim 24, wherein the hybrid 2D-DCT and IDCT system is implemented in hardware on a single VLSI chip.

27. A method for switching between transforming 2D-DCT data and 2D-IDCT data, comprising the steps of: switching first and second multiplexers to pass the 2D-DCT input data through a shuffler then through four sub-block operators to obtain a matrix Z output data representing a 2D-DCT of the input 2D-DCT input data; and switching the first and the second multiplexers to pass the 2D-IDCT input data through the four sub-block operators and then the shuffler to obtain a matrix Z output data representing a 2D-IDCT of the input 2D-IDCT input data, wherein the input data comprises either video data or decoded video data, and wherein said first switching step is performed prior to encoding the video data and said second switching step is performed on the decoded video data after inverse scanning and quantization.

Description

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates generally to calculating the 2-Dimensional 8.times.8 Discrete Cosine Transform (2-D DCT) and the Inverse Discrete Cosine Transform (2-D IDCT), and its very large scale integrated (VLSI) implementation. Specifically, the present invention is well suited to meet the real time digital processing requirements of digital High-Definition Television (HDTV)

2. Related Art

OUTLINE OF RELATED ART SECTION 1.0 Overview of Video Coding and MPEG Implementations 1.1 Video Compression 1.2 MPEG Video Compression: A Quick Look 1.2.1 MPEG Video Sequences, Groups and Pictures 1.2.2 MPEG Video Slice, Macroblock and Block 1.2.3
The Motion Estimation/Compensation in MPEG 1.2.4 The Discrete Cosine Transform in MPEG 1.2.5 The Quantization in MPEG 1.2.6 The Zigzag Scan and Variable Length Coding in MPEG 1.2.7 MPEG Video Encoding Process 1.2.8 MPEG Video Decoding Process 1.3 MPEG-1
Video Standard 1.4 MPEG-2 Video Standard 1.4.1 Fields, Frames and Pictures 1.4.2 Chrominance Sampling 1.4.3 Scalability 1.4.4 Profiles and Levels 1.5 Hybrid Implementation Scheme for MTEG-2 Video System 2.0 DCT/IDCT Algorithms and Hardware Implementations 2.1 Introduction 2.2 1-D DCT/IDCT Algorithms and Implementations 2.2.1 Indirect 1-D DCT via Other Discrete Transforms 2.2.2 1-D DCT via Direct Factorizations 2.2.3 1-D DCT Based on Recursive Algorithms 2.2.4 1-D DCT/IDCT Hardware Implementations 2.3 2-D DCT/IDCT Algorithms and Implementations 2.3.1 2-D DCT via Other Discrete Transforms 2.3.2 2-D DCT by Row-Column Method (RCM) 2.3.3 2-D DCT Based on Direct Matrix Factorization/Decomposition 2.3.4 2-D DCT/IDCT Hardware Implementations 2.4 Summary

1.0 OVERVIEW OF VIDEO CODING AND MPEG IMPLEMENTATIONS

In this section, a brief overview of video compression, Moving Pictures Experts Group (MPEG) video protocols and different implementation approaches are presented. A list of references cited in this application is included in an Appendix. Each of these reference listed is incorporated herein by reference in its entirety.

1.1 Video Compression

The reduction of transmission and storage requirements for digitized video signals has been a research and development topic all over the world for more than 30 years.

Many efforts have been made trying to deliver or store digital television signals, which have a bit-rate of more than 200 Mbit/s in an uncompressed format and must be brought down to a level that can be handled economically by current video processing technology. For example, suppose the pictures in a sequence are digitized as discrete grids or arrays with 360 pels (picture elements) per raster line, 288 lines/picture (a typical resolution for MPEG-1 video compression), three-color separation and sampled with 8-bit precision for each color, the uncompressed video sequence at 24 pictures/second is roughly 60 Mbit/s, and a one-minute video clip requires 448 Mbytes of storage space.

The International Standardization Organization (ISO) started its moving picture standardization process in 1988 with a strong emphasis on real-time decoding of compressed data stored on digital storage devices. A Moving Pictures Experts Group (MPEG) was formed in May 1988 and a consensus was reached to target the digital storage and real-time decoding of video with bit-rates around 1.5 Mbit/s (MPEG-1 protocol) [MPEG1]. At the MPEG meeting held in Berlin, Germany on December 1990, a MPEG-2
proposal was presented that primarily targeted for higher bit-rates, larger picture sizes, and interlaced frames. The MPEG-2 proposal attempted to address a much more broader set of applications than MPEG-I (such as television broadcasting, digital storage media, digital high-definition TV (HDTV) and video communication) while maintaining all of the MPEG-1 video syntax. Moreover, extensions were adopted to add flexibility and functionality to the standard. Most importantly, a spatial scalable extension was added to allow video data streams with multiple resolutions to provide support for both normal TV and HDTV. Other scalable extensions allow the data stream to be partitioned into different layers in order to optimize transmission and reception over existing and future networks [MPEG2].

An overview of MPEG video compression techniques, MPEG-1's video layers and MPEG-2's video layers is presented in section 1.2, 1.3 and 1.4, respectively. A proposed hybrid implementation scheme for MPEG-2 video codec is shown in section 1.5. An outline of rest of the thesis is presented in section 1.6.

1.2 MPEG Video Compression: A Quick Look

An MPEG video codec specifically designed for compression of video sequences. Because a video sequence is simply a series of pictures taken at closely spaced time intervals, these pictures tend to be quite similar from each other except for when a scene change takes place. The MPEG1 and MPEG2 codecs are designed to take advantage of this similarity using both past and future temporal information (inter-frame coding). They also utilize commonality within each frame, such as a uniform background, to lower the bit-rate (intra-frame coding) [MPEG1, MPEG2].

1.2.1 MPEG Video Sequences, Groups and Pictures

An MPEG video sequence is made up of individual pictures occurring at fixed time increments. Except for certain critical timing information in the MPEG systems layers, an MPEG video sequence bitstream is completely self-constrained and is independent of other video bitstreams.

Each video sequence is divided into one or more groups of pictures, and each group of pictures is composed of one or more pictures of three different types: I-, P- and B-type. I-pictures (intra-coded pictures) are coded independently, entirely without reference to other pictures. P- and B-pictures are compressed by coding the differences between the reference picture and the current one, thereby exploiting the similarities from the current to reference picture to achieve high compression ratio. One example of a typical MPEG I-, P- and B-pictures arrangement in display order is illustrated in FIG. 1.

The first coded picture in each video sequence must be an I-picture. I-pictures may be occasionally inserted in different positions of a video sequence to prevent the coding error propagation. For I-pictures, the coding method used by MPEG is similar to that defined by JPEG [JPEG].

P-pictures (predictive-coded pictures) obtain predictions from temporally preceding I- or P-pictures in the sequence and B-pictures (bidirectionally predictive-coded pictures) obtain predictions from the nearest preceding and/or upcoming I- or P-pictures in the sequence. B-pictures may predict from preceding pictures, upcoming pictures, both, or neither. Similarly, P-pictures may predict from a preceding picture or use intra-coding.

A given sequence of pictures is encoded in a different order which they are displayed when viewing the sequence. An example of the encoding sequence of MPEG I-, P- and B-pictures is illustrated in FIG. 2.

Each component of a picture is made up of a two-dimensional (2-D) array of samples. Each horizontal line of samples in this 2-D grid is called a raster line, and each sample in a raster line is a digital representation of the intensity of the component at that point on the raster line. For color sequences, each picture has three components: a luminance component and two chrominance components. The luminance provides the intensity of the sample point, whereas the two chrominance components express the equivalent of color hue and saturation at the sample point. They are mathematically equivalent to RGB primaries representation but are better suited for efficient compression. RGB can be used if less efficient compression is acceptable.

The equivalent counterpart of a picture in broadcast video (for example analog NTSQ) is a frame, which is further divided into two fields. Each field has half the raster lines of the full frame and the fields are interleaved such that alternate raster lines in the frame belong to alternate fields.

1.2.2 MPEG Video Slice, Macroblock and Block

The basic building block of an MPEG picture is the macroblock. The macroblock consists of one 16.times.16 array of luminance samples plus one, two or four 8.times.8 blocks of samples for each of two the chrominance components. The 16.times.16
luminance array is actually composed of four 8.times.8 blocks of samples. The 8.times.8 block is the unit structure of the MPEG video codec and is the quantity that is processed as an entity in the codec.

Each MPEG picture is composed of slices, where each slice is a contiguous sequence of macroblocks in raster scan order. The slice starts at a specific address or position in the picture specified in the slice header. Slices can continue from one macroblock row to the next in--MPEG--I, but not in MPEG-2.

1.2.3 The Motion Estimation/Compensation in MPEG

If there is motion in the sequence, a better prediction is often obtained by coding differences relative to reference areas that are shifted with respect to the area being coded; a process known as motion compensation. The process of determining the motion vectors in the encoder is called motion estimation, and the unit area being predicted is a macroblock.

The motion vectors describing the direction and amount of motion of the macroblocks are transmitted to the decoder as part of the bitstream. The decoder then knows which area of the reference picture was used for each prediction, and sums the decoded difference with this motion compensated prediction to get the final result. The encoder must follow the same procedure when the reconstructed picture will be used for predicting other pictures. The vectors are the same for each pel in a same macroblock, and the vector precision is either a full pel or a half-pel accuracy.

1.2.4 The Discrete Cosine Transform in MPEG

The discrete cosine transform (DCT) is the critical part of both intra and inter coding for MPEG video compression. The DCT has certain properties that simplify coding models and make the coding more efficient in terms of perceptual quality measures.

Basically, the DCT is a method of decomposing the correlation of a block of data into the spatial frequency domain. The amplitude of each data in the spatial (coefficient) domain represents the contribution of that spatial frequency pattern in the block of data being analyzed. If only the low-frequency DCT coefficients are nonzero, the data in the block vary slowly with position. If high frequencies are present, the block intensity changes rapidly from pel to pel.

1.2.5 The Quantization in MPEG

When the DCT is computed for a block of pels, it is desirable to represent the high spatial frequency coefficients with less precision and the low spatial frequency ones with more precision. This is done by a process called quantization. A DCT coefficient is quantized by dividing it by a nonzero positive integer called a quantization value and rounding it to the nearest integer. The bigger the quantization value is, the lower the precision is of the quantized DCT coefficient. Lower-precision coefficients can be transmitted or stored with fewer bits. Generally speaking, the human eye is more sensitive to lower spatial frequency effects than higher ones, which is why the lower frequencies are quantized with higher precision.

As noted above, a macroblock may be composed of four W blocks of luminance samples and two 8.times.8 blocks of chrominance samples. A lower resolution is used here for the chrominance blocks because the human eye can resolve higher spatial frequencies in luminance than in chrominance.

In intra coding, the DCT coefficients are almost completely decorrelated--that is, they are independent of one another, and therefore can be coded independently. Decorrelation is of great theoretical and practical interest in terms of construction of the coding model. The coding performance is also actually influenced profoundly by the visually-weighted quantization.

In non-intra coding, the DCT does not greatly improve the decorrelation, since the difference signal obtained by subtracting the prediction from the reference pictures is already fairly well decorrelated. However, quantization is still a powerful compression technique for controlling the bit-rate, even if decorrelation is not improved very much by the DCT.

Since the DCT coefficient properties are actually quite different for intra and inter pictures, different quantization tables are used for intra and inter coding.

1.2.6 The Zigzag Scan and Variable Length Coding in MPEG

The quantized 2-D DCT coefficients are arranged according to a 1-D sequence known as the zigzag scanning order. In most case, the scan orders the coefficients in ascending spatial frequencies, which is illustrated in FIG. 1.3. By using a quantization table which strongly deemphasizes higher spatial frequencies, only a few low-frequency coefficients are nonzero in a typical block which results in a very high compression.

After the quantization, the 1-D sequence is coded losslessly so that the decoder can reconstruct exactly the same results. For MPEG, an approximately optimal coding technique, based on Huffman coding, is used to generate the tables of variable length codes needed for this task. Variable length codes are needed to achieve good coding efficiency, as very short codes must be used for highly probable events. The run-length-coding and some special defined symbols (such as end-of-block, EOB) permit efficient coding of DCTs with mostly zero coefficients.

1.2.7 MPEG Video Encoding Process

The MPEG video encoding is a process that reads a stream of input picture samples and produces a valid coded bitstream as defined in the specification. The high-level coding system diagram shown in FIG. 4 illustrates the structure of a typical encoder system 400. The-MPEG video divides the pictures in a sequence into three basic categories: I-, P- and B-pictures as described previously.

Since I-pictures are coded without reference to neighboring pictures in the sequence, the encoder only exploits the correlation within the picture. The incoming picture 405 will go directly through switch 410 into 2-D DCT module 420 to get the data in each block decomposed into underlying spatial frequencies. Since the response of the human visual system is much more sensitive to low spatial frequencies than high ones, the frequencies are quantized with a quantization table with 64 entries in quantization module 430, in which each entry is a function of spatial frequency for each DCT coefficient. In zigzag scan module 470, the quantized coefficients are then arranged qualitatively from low to high spatial frequency following a exact same or similar zigzag scan order shown in FIG. 3. The rearranged 1-D sequence data is further processed with an entropy coding (Huffman coding) scheme to achieve further compression. Simultaneously, the quantized coefficients are also used to reconstruct the decoded blocks using inverse quantization (module 440) and an inverse 2-D DCT (reconstruction module 450). The reconstructed blocks stored in frame store memory 455 is used as references for future differential coding for P- and B-pictures.

In contrast, P- and B-pictures are coded as the differences between the current macroblocks and the ones in preceding and/or upcoming reference pictures. If the image does not change much from one picture to the next, the difference will be insignificant and can be coded very effectively. If there is motion in the sequence, a better prediction can be obtained from pels in the reference picture that are shifted relative to the current picture pels (see, motion estimation module 460). The differential results will be further compressed by a 2-D DCT, quantization, zigzag and variable length coding modules (420, 430, 470, 480) similar to the I-picture case. Although the decorrelation is not improved much by the DCT for the motion compensated case, the quantization is still an effective way to improve the compression rate. So MPEG's compression gain arises from three fundamental principles: prediction, decorrelation, and quantization.

1.2.8 MPEG Video Decoding Process

The MPEG video decoding process, which is the exact inverse of the encoding process, is shown in FIG. 5. The decoder 500 accepts the compressed video bitstream 485 generated from MPEG video encoder 400 and produces output pictures 565 according to MPEG video syntax.

The variable length decoding and inverse zigzag scan modules (51-, 520) reverse the results of the zigzag and variable length coding to reconstruct the quantized DCT coefficients. The inverse quantization and inverse 2-D DCT modules (530, 540) are exact the same modules as those in the encoder. The motion compensation in motion compensation module 550 will only be carried out for nonintra macroblocks in P- and B-pictures.

1.3 MPEG-1 Video Standard

The MPEG-1 video standard is primarily intended for digital storage applications, such as compact disk (CD), DAT, and magnetic hard disks. It supports a continuous transfer rate up to 1.5 Mbit/s, and is targeted for non-interlaced video formats having approximately 288 lines of 352 pels and picture rates around 24 Hz to 30 Hz. The coded representation of MPEG-1 video supports normal speed forward playback, as well as special functions such as random access, fast play, fast reverse play, normal speed reverse playback, pause, and still pictures. The standard is compatible with standard 525 and 625-line television formats, and it provides flexibility for use with personal computer and workstation displays [MPEG1].

Each picture of MPEG-1 consists of three rectangular matrices of eight-bit numbers: a luminance matrix (Y) and two chrominance matrices (Cb and Cr). The Y-matrix must have an even number of rows and columns and the Cb and Cr matrices are one half the size of the Y-matrix in both horizontal and vertical dimensions.

The MPEG-1 video standard uses all the MPEG video compression concepts and techniques listed in section 1.2. The MPEG-1 video standard only defines the video bitstream, syntax and decoding specifications for the coded video bitstream, and leaves a number of issues undefined in the encoding process.

1.4 MPEG-2 Video Standard

The MPEG-2 video standard evolved from the MPEG-1 video standard and is aimed at more diverse applications such as television broadcasting, digital storage media, digital high-definition television (HDTV), and communication [MPEG2].

Additional requirements are added into the MPEG-2 video standard. It has to work across asynchronous transfer mode (ATM) networks and therefore needs improved error resilience and delay tolerance. It has to handle more programs simultaneously without requiring a common time base. It also has to be backwards compatible with the MPEG-1. Furthermore it is also targeted to code interlaced video signals, such as those used by the television industry. Much higher data transfer rates can be achieved by the MPEG-2 system.

As a continuation of the original MPEG-1 standard, MPEG-2 borrows a significant portion of its technology and terminology from MPEG-1. Both MPEG-2 and MPEG-1 use the same layer structure concepts (i.e. sequence, group, picture, slice, macroblock, block, etc.). Both of them only specify the coded bitstream syntax and decoding operation. Both of them invoke motion compensation to remove the temporal redundancies and use the DCT coding to compress the spatial information. Also, the basic definitions of I-, P- and B-pictures remain the same in both standards. However, the fixed eight bits of precision for the quantized DC coefficients defined in the MPEG-1 is extended to three choices in the MPEG-2: eight, nine and ten bits.

1.4.1 Fields, Frames and Pictures

At the higher bit-rates and picture rates that the MPEG-2 video targets, fields and interlaced video become important. The MPEG-2 video types are expanded from MPEG-1's I-, P- and B-pictures to I-field picture, I-frame picture, Meld picture, P-frame picture, B-field picture, and B-frame picture.

In an interlaced analog frame composed of two fields, the top field occurs earlier in time than the bottom field. In MPEG-2, coded frames may be composed of any adjacent pairs of fields. A coded I-frame may consist of a I-frame picture, a pair of I-field pictures, or an I-field picture followed by a Meld picture. A coded P-frame may consist of a P-frame picture or a pair of Meld pictures. A coded B-frame may consist of a B-frame picture or a pair of B-field pictures. In contrast to MPEG-I that allows only progressive pictures, MPEG-2 allows both interlaced and progressive pictures.

1.4.2 Chrominance Sampling

Comparing with MPEG-1's single chrominance sampling format, MPEG-2 defines three chrominance sampling formats. These are labeled 4:2:0, 4:2:2 and 4:4:4.

For 4:2:0 format, the chrominance is sampled 2:1 horizontally and vertically as in MPEG-1. For 4:2:2 format, the chrominance is subsampled 2:1 horizontally but not vertically. For 4:4:4 format, the chrominance has the same sampling for all three components and the decomposition into interlaced fields is the same for all three components.

1.4.3 Scalability

In order to cope with services like asynchronous transfer mode (ATM) networks and HDTV with conventional TV backward compatibility, more than one level of resolution and display quality are needed in the MPEG-2 video standard. MPEG-2 has several types of scalability enhancements that allow low-resolution or smaller images to be decoded from only part of the bitstream. MPEG-2 coded images can be assembled into several layers. The standalone base layer may use the nonscalable MPEG-1 syntax. One or two enhancement layers are then used to get to the higher resolution or quality. This generally requires fewer bits than independent compressed images at each resolution and quality, and at the same time achieve higher error resilience for network transmission.

There are four different scalability schemes in the MPEG-2 standard: SNR scalability uses the same luminance resolution in the lower layer and a single enhancement layer. The enhancement layer contains mainly coded DCT coefficients and a small overhead. In high-error transmission environments, the base layer can be protected with good error correcting techniques, while the enhancement layer is allowed to be less resilient to errors. Spatial scalability defines a base layer with a lower resolution and adds an enhancement layer to provide the additional resolution. In the enhancement layer, the difference between an interpolated version of the base layer and the source image is coded in order to accommodate two applications with different resolution requirements like conventional TV and HDTV. Temporal scalability provides an extension to higher temporal picture rates while maintaining backward compatibility with lower-rate services. The lower temporal rate is coded by itself as the basic temporal rate. Then, additional pictures are coded using temporal prediction relative to the base layer. Some systems may decode both layers and multiplex the output to achieve the higher temporal rate. Data partitioning split the video bitstream into two channels: the first one contains all of the key headers, motion vectors, and low-frequency DCT coefficients. The second one carries less critical information such as high frequency DCT coefficients, possibly with less error protection.

1.4.4 Profiles and Levels

Profiles and levels provide a means of defining subsets of the syntax and semantics of MPEG-2 video specification and thereby give the decoder the information required to decode a particular bitstream. A profile is a defined sub-set of the entire MPEG-2 bitstream. A level is a defined set of constraints imposed on parameters in the bitstream.

MPEG-2 defines five distinct profiles: simple profile (SP), main profile (MP), SNR scalable profile (SNR), spatial scalable profile (SPT) and high profile (HP). Four levels are also defined in MPEG-2: low (LL), main (ML), high-1440 (H-14) and high (HL) to put constraints on some of the parameters in each profile because the parameter ranges are too large to insist on compliance over the full ranges even with the four profile subsets defined in the MPEG-2 video syntax. Only some of combinations among the profiles and levels are valid. The permissible level combinations with the main profile and their parameter values are listed in Table 1.1.

TABLE 1.1 Level definitions for main profile Level Parameters Bound High samples/line 1920 (MP@HL) lines/frame 1152 frames/sec 60 luminance rate 62,668,800 bit-rate 80 Mbit/s High-1440 samples/line 1440 (MP@H-14) lines/frame 1152 frames/sec 60 luminance rate 47,001,600 bit-rate 60 Mbits/s Main samples/line 720 (MP@ML) lines/frame 576 frames/sec 30 luminance rate 10,368,000 bit-rate 15 Mbits/s Low samples/line 352 (MP@LL) lines/frame 288 frames/sec 30 luminance rate
3,041,280 bit-rate 4 Mbits/s

The permissible level/layer combinations with high profile and their parameter values are listed in Table 1.2.

TABLE 1.2 Level definitions for high profile Level Parameters Enh. Layer bound Base layer bound High samples/line 1920 960 (HP@HL) lines/frame 1152 576 frames/sec 60 30 luminance rate 83,558,400 19,660,800 bit-rate 80 Mbits/s 25 Mbits/s High-1440 samples/line 1440 720 (HP@H-14) lines/frame 1152 576 frames/sec 60 30 luminance rate 62,668,800 14,745,600 bit-rate 60 Mbits/s 20 Mbits/s Main samples/line 720 352 (HP@ML) lines/frame 576 288 frames/sec 30 30 luminance rate 14,745,600
3,041,280 bit-rate 15 Mbits/s 4 Mbits/s

1.5 Hybrid Implementation Scheme for MTEG-2 Video System

From FIGS. 4 and 5 in section 1.2 one can see that the encoding and decoding systems for MPEG video consist of several function modules. The modules can be classified by their computational requirements:

(1) A vast amount of the operations are paralleled ones in nature and are best suitable for implementation on a parallel structured hardware component. These modules include the 2-D DCT, 2-D IDCT in both encoding/decoding processes, motion estimation, and motion compensation modules.

(2) The computations carried out are serial in nature and can only be carried out with a serial structure. These modules include zigzag scan, inverse scan, variable length coding and variable length decoding modules.

(3) The computations carried out are parallel in nature, but they can be easily carried out with serial structure without suffering much performance penalty. These modules include quantization and inverse quantization modules.

So far there has been a lot of different approaches for the implementing a of MPEG video encoding/decoding system. Table 1.3 gives a brief summary of some MPEG-1 and MPEG-2 video codec system implementations from some major video vendors [Joa96].

TABLE 1.3 MPEG vendors and products Vendor Profile Encoder included Product Array MPEG-1 y H, S ,B C-Cube MPEG-1, 2 y H, C, B, S CompCore MPEG-1, ML + n D, S Digital MPEG-1 y S Future Tel MPEG-1 y B GI SP,MP,LL,ML y C, B, E HMS MPEG-1 n S Hughes MPEG-1 n B IBM MPEG-1, MP y C, B Imedia MP@ML y H, S LSI MPEG-1, MP@ML n C, B Siemens MP,SNR y E Sun MPEG-1 y S, B TI MPEG-1 n S, B Product codes area: H = hardware, S = software, B = boards, C = chips, E = products

One can see from Table 1.3 that all the MPEG-2 video codec implementations so far have been limited to main level MP@ML. For the MPEG-2 encoding process, the biggest obstacles for real-time encoding are motion estimation and 2-D DCT/IDCT. For the decoding process the 2-D IDCT is the most computation intensive task that every real-time decoding scheme needs to overcome. The huge amount of computations required by motion estimation and 2-D DCT/IDCT prevent the current hardware and software implementations of MPEG-2 video to move from MP@ML to higher levels. Table 1.4 shows just how computational intensive the 2-D IDCT is for MPEG-2 video decoding process.

TABLE 1.4 Upper bounds for total sample rate and 8 .times. 8 IDCT rate Profile\Level High High-1440 Main Low Simple SP@ML sample rate 31,104,000 8 .times. 8 IDCT/s 486,000 Main MP@HL MP@H-14 MP@ML MP@LL sample rate 188,006,400 141,004,800
31,104,000 12,165,120 8 .times. 8 IDCT/s 2,937,600 2,223,200 486,000 190,080 SNR SNR@ML SNR@LL sample rate 31,104,000 12,165,120 8 .times. 8 IDCT/s 486,000 190,080 Spatial Spt@H-14 sample rate 86,054,400 8 .times. 8 IDCT/s 1,344,600 High HP@HL HP@H-14 HP@ML sample rate 154,828,800 116,121,600 26,680,320 8 .times. 8 IDCT/s 2,419,200 1,814,400 416,880

From Table 1.4, it is clear that the number of 2-D IDCTs in the decoding process will increase from only 486,000 8.times.8 blocks per second for MP@ML to 2,937,600 blocks per second for MP@HL. Considering that most 8.times.8 2-D IDCT chips developed so far can carry out about 1,500,000 block transforms per second, and that the most powerful video digital signal processor (DSP) chip (TMS320C80 by TI) can only carry out 800,000 8.times.8 2-D IDCT per second [May95], a challenge for providing real-time MPEG-2 High-level hardware exists.

In Section 2, existing 1-D DCT/1DCT and 2-D DCT IDCT algorithms, as well the hardware implementation of these algorithms are reviewed. It is shown that all the existing 2-D DCT/1DCT chip implementations have made use of the separability property of the 2-D DCT/IDCT since very simple communication interconnection can be achieved by this approach. The algorithms that require fewer multiplications through direct matrix factorization/decomposition are not necessarily suitable for hardware implementation. Instead, the regularity of design and feasibility of layout implied by the row-column method seem to be the main concern for chip implementation.

2.0 DCT/IDCT ALGORITHMS AND HARDWARE IMPLEMENTATIONS

In this section, some of the most commonly used one-dimensional and two-dimensional Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (EDCT) algorithms are evaluated. Detailed implementation schemes of some algorithms are also presented.

2.1 Introduction

The development of fast algorithms for the Discrete Fourier Transform (DFT) by Cooley and Tukey [CT65] in 1965 has led to phenomenal growth in its applications in digital signal processing. Similarly, the discovery of the Discrete Cosine Transform (DCT) in 1974 [ANR74] and its potential applications have caused a significant impact in audio and video signal processing. Since 1974, the DCT/IDCT have been widely used in the image and speech data analysis, recognition and compression. They have become an integral part of several standards such as JPEG, MPEG, CCITT Recommendation H.261 and other video conference protocols.

A lot of fast algorithms and hardware architectures have been introduced for one-dimensional (1-D) and two-dimensional (2-D) DCT/IDCT computation. In section 2.2, an overview of major one-dimensional DCT/IDCT algorithms is presented. In section
2.3, the focus is on two-dimensional DCT/IDCT methods and their implementations. A summary is presented in section 2.4. Some typical methods to demonstrate how the 1-D DCT or 2-D DCT computation can be simplified are discussed. These methods can also apply to the 1-D IDCT or 2-D IDCT computation in general.

2.2 1-D DCT/IDCT Algorithms and Implementations

Given N data point x(0),x(1), . . . , x(N-1), the 1-D N-point DCT and IDCT(or DCT-II and IDCT-II defined by Wang) are defined as [Wan84]: ##EQU1##

where ##EQU2## for k=0, 1, . . . , N-1.

Intrinsically, for N-point data sequences, both 1-D DCT and 1-D IDCT require N.sup.2 real multiplications and N(N-1) real additions/subtractions. In order to reduce the number of multiplications and additions/subtractions required, various fast algorithms have been developed for computing the 1-D DCT and 1-D IDCT. The development of efficient algorithms for the computation of DCT/IDCT began immediately after Ahmed et al. reported their work on the DCT [ANR74].

One initial approach for the computation of DCT/IDCT is via Fourier Cosine Transform and its relations to the Discrete Fourier Transform (DFT) were exploited in the initial developments of its computational algorithms. The approach of computing the DCT/IDCT indirectly using the FFT is also borrowed by other researchers to obtain fast DCT/IDCT algorithms via other kinds of discrete transforms (such as Walsh-Hadamard Transform, Discrete Hartley Transform, etc.).

In addition, fast DCT/IDCT algorithms can also be obtained by direct factorization of the DCT/IDCT coefficient matrices. When the components of this factorization are sparse, the decomposition represents a fast algorithm. Since the factorization is not unique, there exist a lot of different forms of fast algorithms. The factorization schemes often fall into the decimation-in-time (DIT) or the decimation-in-frequency (DEF) category [RY90].

Furthermore, there also exist other approaches to develop fast DCT/IDCT algorithms. The fast computation can be obtained through recursive computation [WC95, AZK95], planar rotations [LLM89], prime factor decomposition [YN85], filter-bank approach [Chi94] and Z-transform [SL96], etc.

2.2.1 Indirect 1-D DCT via Other Discrete Transforms

The Fourier Cosine Transform can be calculated using the Fourier Transform of an even function. Since there exist a lot of Fast Fourier Transform (FFT) algorithms, it is natural to first look at the existing FFT algorithms to compute DCT.

Let x(0),x(1), . . . , x(N-1) be a given sequence. Then an extended sequence {y(n)}, which is symmetric about the (2N-1)/2 point, can be constructed as [RY90]: ##EQU3##

Since the N-point Discrete Fourier Transform is defined as: ##EQU4##

The 2N-point sequence {y(n)} defined above can be used to calculate the 2N-point DFT as: ##EQU5##

where W2N denotes exp(-j2.pi./2N). The above formula can be easily decomposed to ##EQU6##

Multiplying both sides of Eq. (2.6) by a factor of ##EQU7##

where C(k) is defined in Eq. (2.1) and (2.2), we directly obtain the N-point DCT results as ##EQU8## for k=0, 1, . . . , N-1. Thus, the N-point DCT X(k) can easily be calculated from 2N-point DFT Y(k) by multiplying by the scale factor ##EQU9##

When the sequence {x(n)} is real, {y(n)} is real and symmetric. In this case, {Y(k)} I can be obtained via two N-point FFTs rather than by a single 2N-point FFT [Sor87, RY90]. Since an N-point FFT requires N log.sub.2 N complex operations in general, the N-point DCT X(k) can be computed with 2N log.sub.2 N complex operations plus the scaling with ##EQU10##

In the same spirit, the N-point DCT computation may also be calculated via other transforms such as Walsh-Hadamard Transform (WHT) [Ven88] for N.ltoreq.16 and Discrete Hartley Transform (DHT) [Mal87]. The WHT is known to be fast since the computation involves no multiplications. Thus an algorithm for DCT via WHT may well utilize this advantage. The DHT, on the other hand, is very similar to DFT. The detailed implementation of these two transforms can be found in [RY90].

2.2.2 1-D DCT via Direct Factorizations

Consider the computation of the DCT of an input sequence {x(n)} I and let this sequence be represented by a (N.times.1) column vector x, then the transformed sequence (in vector form) of DCT computation can be expressed in vector notation as follows [RY90]: ##EQU11##

where A.sub.N is an N.times.N coefficient matrix and each element of A.sub.N is defined as: ##EQU12##

When the matrix A.sub.N is factored into sparse matrices, the number of computations is reduced.

One way to achieve a fast 1-D DCT computation by sparse matrix factorizations is as follows: Assume N is a power of 2, A.sub.N can then be decomposed in the form ##EQU13##

where A.sub.N/2 is the coefficient matrix for a N/2-point DCT; P.sub.N is a permutation matrix which permutes the even rows in increasing order in the top half and the odd rows in decreasing order in the bottom half; B.sub.N is a butterfly matrix which can be expressed in terms of the identity matrix I.sub.N/2 and the opposite identity matrix I.sub.N/2 (i.e. the elements position on the opposite diagonal are equal to 1, others are 0) as follows: ##EQU14##

R.sub.N/2 is the remaining (N/2.times.N/2) block in the factor matrix which can be obtained by reversing the orders of both the rows and columns of an intermediate matrix R.sub.N/2, where the definition of each element of R.sub.N/2 is: ##EQU15##

The factorization of Eq. (2.10) is only partly recursive because the matrix R.sub.N/2 can not be recursively factored. However, there is regularity in its factorization, where it can be decomposed into five types of inatrixfactors and all of them have no more that two non-zero elements in each row [Wan83]. And only ##EQU16##

real multiplications and ##EQU17##

real additions are required by this approach.

The key of this approach is that the A.sub.N is reduced in terms of A.sub.N/2. Take a 4-point sequence for example, the matrix A.sub.4 can be decomposed as: ##EQU18##

where ##EQU19##

Alternatively, some factorization schemes have adopted decimation-in-time (DIT) or decimation-in-frequency (DIF) approach, which achieve fast computation through rearranging the input sequence {x(n)} or output sequence {X(k)}, respectively.

Looking at the DIT approach for example. If the scale factors in Eq. (2.1) are left out for convenience, the transformed sequence X(k) can be expressed as: ##EQU20##

There are two steps in the DIT approach, and their objective is to reduce an N-point DCT to an N/2-point DCT by permutation of the input sample points in the time domain. The first step in the DIT algorithm consists of a rearrangement of the input sample points. The second step reduces the N-pomit transform to two N/2-point transforms to establish the recursive aspect of the algorithm [RY90].

Deefining ##EQU21##

with x(-1)=x(N)=0 as the initial conditions for x(n).

Using the properties of the cosine functions, it is easy to see that Eq. (2.16) can be substituted into Eq. (2.15), resulting in: ##EQU22##

In Eq. (2.16), the sequence {H(k)} is obtained as a DCT with N/2 sample points and {G(k)} is obtained as a DCT with (N/2+1) sample points. Each of these smaller transform can be further reduced, which leads to the desired recursive structure. Excluding scaling and normalization, it is found that for an N-point (N being power of 2) sequence {x(n)}, the DIT algorithm for DCT requires ((N/2)log.sub.2 N+N/4) real multiplications and ((3N/2-1)log.sub.2 N+N/4+1) real additions [RY90].

When the rearrangement of the sample points results in the transformed sequence being grouped into even- and odd-frequency indexed portions, the decomposition is said to constitute a DIF algorithm. For an N-point (radix-2) sequence {x(n)}, the DIF algorithm for DCT requires (N/2) log.sub.2 N real multiplications and ((3N/2)log.sub.2 N-N+1) real additions [RY90].

2.2.3 1-D DCT Based on Recursive Algorithms

In addition to the algorithms described in the previous sections, there exist many more other different kind approaches. Using some well-known recursive algorithms to compute DCT/IDCT is one of them which can achieve the goal of fast computation. Two typical ones are shown here: Chebyshev Polynomial recurrence and Clenshaw's recurrence formula.

One fast recursive algorithm for computing the DCT based on the Chebyshev Polynomial factorization is proposed by Wang and Chen [WC95]. Recall the following trigonometric identity:

which, by the way, can be proved using the Chebyshev polynomial. If one leaves out the scale factors in Eq. (2.1) for convenience (i.e. use Eq. (2.15) as the definition of the 1-D DCT) and define the recursive variables as ##EQU23##

the 1-D DCT can be computed using the following Chebyshev polynomial recurrence:

where X(k)=A(k,N-1), k=0, 1, . . . , N-1. Thus the X(k) can be calculated in N recursive steps from the input sequence x(n) using Eq. (2.20) and (2.21). For an N-point sequence {x(n)}, this recursive algorithm requires 2N(N-1) real multiplications and real additions.

In addition, Aburdene et al. proposed another fast recursive algorithm for computing the DCT based on the Clenshaw's (or Goertzel's as called in other papers) recurrence formula [AZK95]. The Clenshaw's recurrence formula states that considering a linear combination of the form ##EQU24##

in which F(x,n) obeys a recurrence relation

for some functions .alpha.(x,n) and .beta.(x,n), then the sum f(x) can be computed as

where {.psi.(n)} can be obtained from the following recurrence relations:

Defining

the 1-D DCT can be expressed as ##EQU25##

The calculation of F(.lambda..sub.k,n) can be made recursively using the identity

to generate the recurrence expression for F(.lambda..sub.k, n+1) as

Comparing Eq. (2.23) and (2.29), one can see that the terms .alpha.(x,n) and .beta.(x,n) in Eq. (2.23) should be chosen as 2 cos (.lambda..sub.k) and -1.

Substitute Eq. (2.24) in Eq. (2.27), we can find ##EQU26##

where .psi.(n) is obtained from Eq. (2.25) as

Thus, .psi.(n) can be recursively generated from the input sequence x(n) according Eq. (2.31). And at the Nth step, X(k) can be evaluated by Eq. (2.30) for k=0, 1, . . . , N-1. For an N-point sequence {x(n)}, this recursive algorithm requires about N.sup.2 real multiplications and real additions.

2.2.4 1-D DCT/IDCT Hardware Implementations

The algorithms that compute the DCT/IDCT indirectly via other discrete transforms are normally not the good candidate for hardware implementation. The conversion between the input and output data of two different transforms is generally complicated. Many transforms, like FFT and WHT, use complex architectures, which make the hardware implementations of the 1-D DCT even less efficient. The advantage of computing the 1-D DCT via DFT is that the standard FFT routines and implementations are available that can be directly used in the DCT/IDCT.

TABLE 2.1 Summary of some 1-D DCT algorithms Arith- Inter- Number of metic connection Algorithm 1-D DCT via Multiplications Types Complexity [Har76] DFT/FFT 2Nlog.sub.2 N Com- High plex [Wan83] Direct Factor. Nlog.sub.2 N - 3N/2 + 4 Real Very High [RY90] Factor./DIT (N/2)log.sub.2 N + N/4 Real High [WC95] Recursive 2N(N - 1) Real Low [AZK95] Recursive N.sup.2 Real Low

The algorithms that compute the DCT/IDCT via direct factorizations have the advantages that they are reasonably fast and recursive in some degree. These algorithms make full use of the sparseness of the DCT/IDCT coefficient matrix and require much fewer multiplications and additions/subtractions. But the complicated index mapping of global interconnection from the input and to the output data makes the hardware implementations rather difficult.

Alternatively, although the DCT/IDCT algorithms based on recursive approaches do not necessarily use fewer operations than other discrete transforms, the recursive nature makes them easy to be implemented with relatively simple processing elements (PE) and simple interconnections among the PEs. Identical or similar structured PEs in a hardware implementation can greatly reduce the cost of the design and layout process. It has been shown that time recursive algorithms and the resulting DCT/IDCT architectures are well suited for VLSI implementation.

One of the recursive schemes that can be easily adopted for the 1-D DCT hardware implementation is the Chebyshev polynomial method (described in section 2.2.3). The basic function cell to compute the 1-D DCT based on this method is shown in FIG.
6 [WC95]. For N-point input sequence, total N cells are required for k=0, 1, . . . , N-1. Since these N cells have identical structure, functional design and layout cost can be reduced correspondingly.

Another example of the 1-D DCT hardware implementation using recursive scheme is based on Clenshaw's recurrence formula (described in section 2.2.4). The hardware structure of the implementation is shown in FIG. 7 [AZK95].

2.3 2-D DCT/IDCT Algorithms and Implementations

Similar to the definitions of the 1-D DCT/IDCT, the forward and inverse 2-D Discrete Cosine Transform (2-D DCT/IDCT) of an input sequence x(m,n), 0.ltoreq.m,n<N, are defined as: ##EQU27##

where ##EQU28##

For an N.times.N point input sequence, both the 2-D DCT and 2-D IDCT require O(N.sup.4) real multiplications and corresponding additions/subtractions, assuming the computations are carried out by brute force. In order to improve the efficiency of 2-D DCT and 2-D IDCT computations, various fast computational algorithms and corresponding architectures have been proposed. In general, all of these algorithms can be broadly classified into 3 basic categories: 1) compute the 2-D DCT/IDCT indirectly via other discrete fast transforms, 2) decompose the 2-D DCT/IDCT into two 1-D DCT/IDCTs, and 3) compute the 2-D DCT/IDCT based on direct matrix factorization or decomposition.

Computation of the 2-D DCT/IDCT via other discrete fast transforms manages to take advantage of the existence of other kinds 2-D discrete transform algorithms and architectures. The best candidates that can be employed to perform the 2-D DCT/IDCT, for example, are the 2-D FFT and 2-D WHT [NK83, Vet85].

However, the decomposition of a 2-D DCT/IDCT into two 1-D DCT/IDCTs, which conventionally is also called the Row-Column Method (RCM), evaluates the 1-D DCT/IDCT in row-column-wise or column-row-wise form. That is, it starts by processing the row (or column) elements of input data block as a 1-D DCT/IDCT and store the results in an intermediate memory; it then processes the transposed column (or row) elements of the intermediate results to further yield the 2-D DCT/IDCT results [CW95, SL96, MW95, Jan94]. Since the RCM reduces the 2-D DCT into two separate 1-D DCTs, existing 1-D algorithms listed in section 2.2 can be directly used so that the computational complexity can be simplified.

The direct 2-D factorization methods work directly on the 2-D data set and coefficient matrices. This kind of approach mainly concentrates on reducing the redundancy within the 2-D DCT/IDCT computations so that much fewer multiplications would be required [DG90, CL91, Lee97].

2.3.1 2-D DCT via Other Discrete Transforms

The close relationship between the DCT and the DFT can also be exploited in the two-dimensional case.

As shown by Nasrabadi and King [NK83], a rearrangement of the input matrix elements easily leads to expressions involving evaluation of two-dimensional DFTs. Leaving the scale factors out of Eq. (2.32) and Eq. (2.33), and treating x(m,n) and X(k,l) as scaled and normalized two-dimensional input and output data as ##EQU29##

Define an intermediate N.times.N transform sequence

where the 2-D DFT of y(m,n) can be calculated as: ##EQU30##

and the W.sub.N.sup.k denotes exp(-j2k.pi./N).

Furthermore, using a simple compound angle formula for the cosine functions, it is possible to derive the following similarly to Eq. (2.7) as

Above Eq. (2.38) is sometimes referred to as representing "phasor-modified" DFT components. And it can be further simplified as

where

Since Y(k,l) is the 2-D DFT, its implementation can be realized by using any of the available 2-D algorithms. One of the most efficient methods proposed by Nussbaumer is to compute the 2-D real DFT by means of the polynomial transforms [Nus81]. The reduction in computational complexity is obtained by mapping the DFT on the index m to polynomial transform. Overall, an N.times.N point DCT is mapped onto N DFTs of lengths N. For real N.times.N input sequence {x(m,n)}, the 2-D DCT requires ((N.sup.2 /2-1)log.sub.2 N+N.sup.2 /3-2N-8/3) complex multiplications and ((5N.sup.2 /2)log.sup.2 N+N.sup.2 /3-6N-62/3) complex additions.

Besides, the 2-D DCT can also be carried out via the 2-D Walsh-Hadamard Transform (WHT) [Vet85].

2.3.2 2-D DCT by Row-Column Method (RCM)

Like some other discrete transforms, such as DFT, WHT, ST, HT, etc., the 2-D DCT is a separable transform. And Eq. (2.32) can also be expressed as ##EQU31##

The inner summation ##EQU32##

is an N-point 1-D DCT of the rows of x(m,n), whereas the outer summation represents the N-point 1-D DCT of the columns of the "semi-transformed" matrix, whose

elements are ##EQU33##

where m,l=0, 1, . . . , N-1.

This implies that a 2-D N.times.N DCT can be implemented by N's N-point DCTs along the columns of x(m,n), followed by N's N-point DCTs along the rows of the results after the column transformations. In practice, the order in which the row transform and the column transform are done is theoretically immaterial.

All 1-D DCT fast algorithms discussed in section 2.2 can be used here to simplify the 2-D DCT computation, which requires totally 2N's 1-D DCTs. For example, if the 1-D DCT is carried out via the 1-D FFT, approximate 2N.times.(2N log.sub.2 N) complex operations plus the scaling are required.

2.3.3 2-D DCT Based on Direct Matrix Factorization/Decomposition

In the RCM, the computation reduction applies only to one 1-D array at a time. That makes these algorithms less efficient and not quite modular in structure. Haque reported a 2-D fast DCT algorithm based on a rearrangement of the elements of the two-dimensional input matrix into a block matrix form [Haq851. Each block of the matrix is then calculated via a "half-size" 2-D DCT.

The N.times.N DCT block decomposition of Eq. (2.34) is based upon the following procedures: (1) Decompose the N.times.N input data x(m,n) into four (N/2).times.(N/2) sub-blocks:

The computation of the 2-D DCT based on the Haque's algorithm requires ((3/4)N.sup.2 log.sub.2 N) multiplications and (3N.sup.2 log.sub.2 N-2N.sup.2 +2N) additions [Haq85, RY90].

As an alternation, Cho and Lee proposed another approach for decomposing a 2-D DCT [CL91]. Using the following trigonometric relation ##EQU36##

the 2-D DCT in Eq. (2.34) can be rewritten as

where ##EQU37##

After some complicated data reordering and manipulations, Cho and Lee have shown that A(k,l) and B(k,l) can be expressed in terms of N's 1-D DCTs so that an N.times.N DCT can be obtained from N's separate 1-D DCTs [CL91].

2.3.4 2-D DCT/IDCT Hardware Implementations

A lot of papers have been written lately on the development of VLSI and chip implementation of the 2-D DCT/IDCT. Most of the works have concentrated on the basic block size of W, because the W block has been found to be able to provide sufficient details and localized activities of the image such that it has been adopted as the standard 2-D DCT/IDCT size in almost all existing image and video processing and compression protocols.

TABLE 2.2 Summary of some 2-D DCT algorithms Inter- 2-D DCT Number of Arithmetic connection Algorithm via Multiplications Types Complexity [Nau81] DFT/FFT (N.sup.2 /2 - 1) log.sub.2 N + Complex High N.sup.2 /3 - 2N - 8/3 [WC95] RCM/ 4
N.sup.2 (N - 1) Real Low Recursive [CSF77] RCM/ N3 Real Medium Factorization [CL91] Direct 2-D N3 Real Very High Factorization [Haq85] Direct 2-D (3/4) N.sup.2 log.sub.2 N Real Very High Factorization

Because of the limitation of areas and interconnections in VLSI implementation, not much of the chip development work has included the mapping of fast, two-dimensional algorithms onto silicon directly. Instead, regularity of design and feasibility of layout seem to be the primary concern, together with a realistic throughput rate for real-time applications. However, there have been attempts to map Lee's algorithm [Lee84] onto silicon [RY90]. As well, chips based on a single processor rotation [LLM89] are also being reported [RY90]. But all of them are limited to 1-D DCT/IDCT applications.

In practice, the 2-D DCT algorithms based on other discrete transforms suffer the same setbacks as their 1-D counterparts: complex arithmetic operations. complicated conversion between the two different transforms and complex index mapping, which make the hardware implementations via other discrete transforms rather difficult.

Generally speaking, the 2-D DCT algorithms based on direct matrix factorization or decomposition are much more suitable for software implementation, because they usually require fewer multiplications than other approaches and the complex index mapping involved is not a problem for software. The high communication complexity and global interconnection involved in these algorithms make them difficult to be implemented using VLSI technology.

The 2-D DCT algorithms based on the RCM approach, however, can be realized using a very simple and regular structure, since the RCM reduces the 2-D DCT into two stages of 1-D DCTs and the existing 1-D DCT algorithms listed in section 2.2 can be employed directly. The relative simple localized interconnections of the RCM is another key feature making it suitable for VLSI implementation. The block diagram of the "row-column" transform approach to realize an N.times.N 2-D DCT is illustrated in FIG. 8.

Needless to say, variations on this basic block structure are many. Some use special devices for the intermediate memory transposition operation. Some use a single, 1-D DCT processor to perform both row and column transformations one-by-one in order to reduce the die size [MW95]. Others use time-recursive algorithms and architectures to achieve regular and modular structure [SL96, WC95]. Some proposed systolic array architecture of RCM can even avoid using an intermediate matrix transposition circuitry with the extra expense of data synchronization and input sequence reordering [CW95].

It is worth notice that all the chip developments have one common ground. Almost all the 2-D DCT or IDCT processors developed so far have made use of the separability property of the 2-D DCT or IDCT by decomposing it into two separate 1-D transforms. None have attempted to directly map a specific 2-D DCT or IDCT algorithm to silicon.

2.4 Summary

In this section, various conventional approaches for computing the 1-D DCT have been examined, as well as some of the algorithms designed to implement the 2-D DCT have also been investigated. The 1-D algorithms can be loosely classified as the DCT via other transforms, via sparse matrix factorization and via time-recursive approaches. Similar, the 2-D algorithms can be classified as DCT via other transforms, via direct matrix factorization/decomposition and via Row-Column methods. Both the
1-D IDCT and 2-D IDCT can be computed and implemented with approaches similar to the 1-D DCT and 2-D DCT.

The most prominent property is the separability property of the 2-D DCT or IDCT, which has been exploited both in the algorithms and in the chip designs. Almost all existing 2-D DCT or IDCT processors are based on the reduction of 2-D DCT or IDCT to a lexicographically ordered 1-D transforms (i.e. RCM).

Compared with the Row-Column methods, the direct 2-D DCT or IDCT matrix factorization/decomposition is more computation efficient and generally requires fewer multiplications. But the complex global communication interconnection of existing direct 2-D DCT or IDCT algorithms has prevented them from being implemented in VLSI chips due to design and layout concerns.

SUMMARY OF THE INVENTION

The present invention provides a method and system for computing 2-D DCT/IDCT which is easy to implement with VLSI technology to achieve high throughput to meet the requirements of high definition video processing in real time.

The present invention is based on a direct 2-D matrix factorization approach. The present invention computes the 8.times.8 DCT/IDCT through four 4.times.4 matrix multiplication sub-blocks. Each sub-block is half the size of the original
8.times.8 size and therefore requires a much lower number of multiplications. Additionally, each sub-block can be implemented independently with localized interconnection so that parallelism can be exploited and a much higher DCT/IDCT throughput can be achieved.

Further embodiments, features, and advantages of the present inventions, as well as the structure and operation of the various embodiments of the present invention, are described in detail below with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE FIGURES

The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the pertinent art to make and use the invention.

FIG. 1 is a diagram of a typical MPEG video pictures display order.

FIG. 2 is a diagram of a typical MPEG video pictures coding order.

FIG. 3 is a diagram of a zig-zag scanning order of DCT scanning coefficients.

FIG. 4 is a diagram of an example simplified MPEG video encoding process.

FIG. 5 is a diagram of an example simplified MPEG video decoding process.

FIG. 6 is a block diagram of an example processing element cell k for a Chebyshev polynomial recurrence.

FIG. 7 is a block diagram of a recursive implementation of 1-D DCT based on Clenshaw's formula.

FIG. 8 is a block diagram of a row-column approach for performing a 2-D DCT.

FIG. 9 shows five graphs of an example 2-D DCT simulation with finite coefficient word lengths according to the present invention.

FIG. 10 shows five graphs of an example 2-D IDCT simulation with finite coefficient word lengths according to the present invention.

FIG. 11 shows five graphs of an example 2-D DCT simulation with finite truncation lengths according to the present invention.

FIG. 12 shows five graphs of an example 2-D IDCT simulation with finite truncation lengths according to the present invention.

FIG. 13 is a block diagram of DCT data flow according to an embodiment of the present invention.

FIG. 14 is a block diagram of IDCT data flow according to another embodiment of the present invention.

FIG. 15 is a diagram of an example shuffler data structure according to the present invention.

FIG. 16 is a diagram of an example EE-sub-block according to the present invention.

FIG. 17 is a diagram of example latching, multiplexing, and clipping stages for respective 2-D DCT of EE, EO, OE, and OO sub-blocks according to an embodiment of the present invention.

FIG. 18 is a diagram of an example architecture and data for a 2-D DCT according to an embodiment of the present invention.

FIG. 19 is a diagram of an example architecture and data for a 2-D IDCT according to an embodiment of the present invention.

FIG. 20 is a diagram of an example combined architecture and data for 2-D DCT and 2-D IDCT according to an embodiment of the present invention.

FIG. 21 is a flowchart of a synthesis approach according to an embodiment of the present invention.

FIG. 22 is a timing diagram illustrating waveforms of 2-D DCT input/output and handshaking signals for an example VLSI implementation according to the present invention.

FIG. 23 is a timing diagram illustrating waveforms of 2-D IDCT input/output and handshaking signals for an example VLSI implementation according to the present invention.

The present invention will now be described with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Additionally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Overview

As recognized by the inventor, given that the MPEG encoding/decoding process can be decomposed into parallel and serial operations, it seems natural to use some kind of hybrid scheme to implement all the functions in the MPEG-2 encoding/decoding process. One hybrid scheme approach that combines a specially designed hardware component with an ordinary DSP chip is: (1) Use a specially designed ASIC with a parallel structure to implement the 2-D DCT/IDCT and the motion estimation; and (2) Use an inexpensive DSP to implement the serial operations and provide the control structure to the ASIC.

With this hybrid approach, the combined system not only can take advantage of the powerful parallel processing abilities of the hardware components, but also possesses the flexibility of software programming to cope with different encoding/decoding parameters required. The hybrid scheme, which includes dedicated 2-D DCT/IDCT and motion estimation (for encoder only) ASIC components plus a serial structured DSP chip, might be the best feasible architecture to meet the requirements of MPEG-2 real-time encoding/decoding.

Since the 2-D DCT/IDCT computation is the fundamental element of a MPEG video encoding and decoding, the development of a 2-D DCT/IDCT hardware module is a high priority. Existing systems have used a pure software approach to implement a MPEG-2
encoding/decoding system. The inventor has developed a new 2-D 8.times.8 DCT/IDCT algorithm and designed an ASIC to implement this algorithm.

A 2-D DCT/IDCT algorithm according to one embodiment of the present invention is described in Section 3. Starting with a simple matrix notation of the 2-D DCT and 2-D IDCT, it presents a detailed step-by-step description of the new algorithm. The algorithm is based on a direct 2-D matrix factorization and has better finite wordlength precision, requires fewer multiplication operations, and possesses regular structure and localized interconnection than traditional approaches. Furthermore, it is shown in Section 3 that the algorithm can easily be implemented with only adders, subtractors, and adder/subtractor combinations.

Finite wordlength simulation of an embodiment of the algorithm is described in Section 4. The impacts of both coefficient quantization and truncation errors are fully investigated. It is also shown in this section that optimal implementation scheme is achieved by combining different finite wordlengths for coefficient quantization and data truncation. In order to meet the accuracy requirements of H.261 and JPEG for both the 2-D DCT and 1-DCT, only 16-bit finite internal wordlength is required by the proposed algorithm.

Section 5 presents the detailed hardware architectural structure for the new 2-D DCT/IDCT algorithm according to one example implementation of the present invention. It is shown that the new algorithm leads to a highly modular, regular and concurrent architecture using standard components such as ac shuffler, adders, subtractors, accumulators, latches, and some multiplexers, etc. The combined 2-D DCT/IDCT architecture demonstrates that all execution components are 100% sharable between the
2-D DCT and 2-D IDCT operations.

The HDL design and logic synthesis processes for an embodiment of the algorithm are demonstrated in Chapter 6. Using a modern synthesis-oriented ASIC design approach, the chip implementation is simulated through I-ML functionality coding, RTL code simulation, logic synthesis from the verified RTL code and gate-level pre-layout simulation in several stages. The highly automated Computer Aided Design (CAD) tools used in the simulation process are Cadence's Verilog-XL.RTM. simulation package and Synopsys Design Compiler.RTM. synthesis package, respectively. The chip simulation shows that a 800 million samples per second throughput rate can be achieved for both 2-D DCT and IDCT computations according to the present invention.

Finally in Section 7, contributions of the present invention and applications of this invention are discussed.

OTLINE OF DETAILED DESCRIPTION SECTION 3.0 2-D DCT/IDCT Algorithm 3.1 Introduction 3.2 2-D 8.times.8 DCT Algorithm 3.3 2-D 8.times.8 IDCT Algorithm 3.4 Further Simplification of the 4.times.4 Matrix Multiplications 3.5 Summary 4.0 Finite Wordlength Simulations 4.1 Introduction 4.2 Coefficient Quantization Error Effects 4.2.1 2-D DCT Simulation Results 4.2.2 2-D IDCT Simulation Results 4.3 Truncation Error Effects 4.3.1 2-D DCT Simulation Results 4.3.2 2-D IDCT Simulation Results 4.4
Combined Quantization and Truncation Error Effects 4.5 Comparison with Row-Column Method 4.6 Summary 5.0 Hardware Architecture Design 5.1 Introduction 5.2 Shuffler--Addition/Subtraction Shuffling Device 5.3 Sub-block Operator--4.times.4 Matrix Multiplications Unit 5.4 Auxiliary Components for 2-D DCT or IDCT Implementations 5.5 Architectures for DCT, IDCT and Combined DCT/IDCT 5.6 Summary 6.0 HDL Design and Synthesis for an Example 2-D DCT/IDCT Algorithm 6.1 Introduction 6.2 HDL Design for Shuffler 6.2.1 HDL Design for Sub-block Operators 6.2.2 HDL Design for Auxiliary Components 6.2.3 HDL Simulation for Combined 2-D DCT/IDCT 6.3 Login Synthesis for Example 2-D DCT/IDCT Algorithm 6.4 Summary 7.0 Conclusions 7.1 Contributions of this Invention 7.2 Other Applications

3.0 2-D DCT/IDCT ALGORITHM

In this section, a 2-D 8.times.8 DCT/IDCT algorithm according to the present invention is described. Based on a direct 2-D approach, not only is this algorithm more computation efficient and requires fewer multiplications than traditional approaches, but it also results in a simple, regular architecture with localized communication interconnections.

3.1 Introduction

In 2-D DCT or IDCT chip development, almost all the 2-D DCT or IDCT processors developed so far have made use of the separability property of the 2-D DCT or IDCT. Although the 2-D DCT or IDCT based on RCM approach can be realized using a very simple, regular structure with relative low design and layout-cost, there are several major drawbacks in almost all 2-D DCT/IDCT implementations based on the RCM approach [Ura96, SL92, Jan94, MW95, SL96, etc.]:

(1) Memory components are required to store the intermediate results between the 1-D row and 1-D column transform. And memory cells take a lot of silicon area to implement.

(2) Because it is relatively difficult to design the memory block with multiple read/write accesses, serial data in and serial data out mode is adopted by most of the RCM approaches. Serial data I/O results in relatively low system throughput for 2-D DCT or EDCT operation. Generally, RCM approaches can only achieve a half of the system clock rate as system sample processing rate, since the second 1-D transform will not start until the first 1-D transform finishes and the transposed intermediate data is ready. But exceptions have been made to achieve throughputs as high as the system clock rate by using a two intermediate memory buffers and transpose circuitry such that the intermediate data are stored in each of the memory buffers alternatively and latency constraints of the intermediate data can be avoided [UraM]. Some have adopted different I/O clock and system clock rates to balance the I/O and data processing speed [SL96].

(3) Complex transposition hardware is required to transpose the output of the 1-D row (column) transforms into the input format of the 1-D column (row) transforms. The faster matrix transposition the system requires, the higher communication complexity it will involve.

(4) The latency of the RCM is relatively high because the 1-D row and column transform must be calculated sequentially.

(5) The separability property of the 2-D DCT or IDCT used by the RCM limits it to be able to make full use of the 1-D optimal solution, and it is not possible for them to take the full advantage of 2-D's sparseness and factorization.

Although the direct 2-D DCT or IDCT matrix factorization is more computation efficient and generally requires a smaller number of multiplications, the major obstacle preventing this approach from being implemented in VLSI hardware is the complexity of its global communication interconnection. The present invention provides an algorithm which makes full use of the computational efficiency of a direct 2-D approach and has localized communication interconnection(s) so as to be suitable for VLSI implementation and meet the speed requirement of video applications, including real time applications.

This section describes an algorithm that achieves this goal. Direct 2-D DCT and IDCT algorithms are presented step-by-step in section 3.2 and 3.3, respectively. In addition, the core component of these direct algorithms according to one embodiment of the present invention is characterized in detail in section 3.4. A summary is provided in section 3.5.

3.2 2-D 8.times.8 DCT Algorithm

Let X(m,n) and X(k,l) be the N.times.N input and output sequences for 0.ltoreq.m,n<N, then the forward 2D Discrete Cosine Transform in Eq. (2.32) can be rewritten as: ##EQU38##

where ##EQU39##

For N+8, a coefficient vector W, which is the cosine function of angles (K.pi./2N) for k=1, 2, . . . , N-1, can be defined as ##EQU40##

In the meanwhile, the 2-D 8.times.8 DCT in Eq. (3.1) can be expressed in matrix notation similar to 1-D case in Eq. (2.8), as:

where A.sup.T is the transpose of matrix A and the elements of A are ##EQU41##

If each a.sub.ij is replaced with the elements of coefficient vector W, the matrix A can be expressed as the function of coefficient w.sub.k for k=1, 2, . . . , N-1 as ##EQU42##

From Eq. (3.3) one can see that the 2-D DCT can be decomposed into two stages of 1-D DCT as [Ura92, Jan94, MW95]:

and a total of 2N.sup.3 multiplications are required to compute matrix X by brute force approach. Since the even rows of matrix A (i.e., A(0), A(2), A(4) and A(6)) are even-symmetric and the odd rows (i.e., A(1), A(3), A(5) and A(7)) are odd-symmetric, it is possible to facilitate the computation of 1-D column transform of Y=AX by simply switching all the even rows of matrices A and Y to the top half and all the odd rows to the bottom half, which can be carried out by multiplying a permutation matrix P1. Two new matrices Y' and A' can be defined as

where P1 is defined as ##EQU43##

By expressing matrices Y' and A' as the functions of the row vectors of the matrices Y and A, Eq. (3.6) can be further extended as ##EQU44##

where Y(k) and A(k) are the row vectors of matrices Y and A, respectively. Now, the rows at the top half of matrix A'=P1.multidot.A are even-symmetric and the rows at the bottom half are odd-symmetric. Thus, the matrix multiplication of Y'=A'X can be calculated through two 4.times.8 matrices as ##EQU45##

where X(k) is the row vector of matrix X.

It can be seen that one 4.times.4 coefficient matrix in Eq. (3.9) only includes the even coefficients of vector W (i.e. w.sub.2, w.sub.4, w.sub.6) and the other only includes the odd coefficients (i.e. w.sub.1, w.sub.3, w.sub.5, w.sub.7). In fact, they can be defined as two new 4.times.4 coefficient matrices E and O [Jan94, MW95]. By means of matrix notation, E and O can also be computed directly from the coefficient matrix A by the following matrix operations as: ##EQU46##

where P2 and P3 are defined in Eq. (3.7) as the top and bottom blocks of matrix P1, and P4 as a new permutation matrix that takes the first four columns of an 4.times.8 matrix to form an 4.times.4 one. Mathematically, the matrix P4 can be defined as ##EQU47##

where the matrix I.sub.4 is an 4.times.4 identity matrix, and the matrix N.sub.4 is an 4.times.4 null (zero) matrix.

In addition, the matrices [X(i)+X(j)] and [X(i)-X(j)] in Eq. (3.8) and Eq. (3.9) can also be defined as two separate 4.times.8 matrices as X.sub.+ and X.sub.- into their left and right blocks, respectively, as: ##EQU48##

After substituting the X.sub.+, X.sub.-, X.sub.+l, X.sub.+r, X.sub.-l and X.sub.-r in Eq. (3.12) and the E, O in Eq. (3.10) into Eq. (3.9), the 1-D column transform Y=AX can be calculated by first calculating its permutation Y'. The substitution can be carried out as follows: ##EQU49##

Furthermore, if the input matrix X is decomposed into four 4.times.4 sub-matrices as ##EQU50##

Using the Eq. (3.12), the matrices X.sub.+l, X.sub.+r, X.sub.-l and X.sub.-r can also be expressed as the functions of matrices X1, X2, X3 and X4 as

where I.sub.4 is defined as an 4.times.4 opposite identity matrix as ##EQU51##

Since the X.sub.+l, X.sub.+r, X.sub.-l and X.sub.-r have been expressed as the functions of matrices X1, X2, X3 and X4 in Eq. (3.15), by substituting them into Eq. (3.13), the first stage of 1-D column transform Y=AX can be expressed directly as the function of input matrix X as ##EQU52##

where matrices P1, E and O are defined in Eq. (3.7) and (3.10), respectively.

Similar mathematical manipulations can be applied to the second stage of 1-D row transform Z=YA.sup.T, too. By switching the row vectors of matrix Z, a new matrix Z' can be formed as the function of the row vectors of matrix Z as ##EQU53##

Take a transpose on the both sides of Eq. (3.18), the transposition of matrix Z' can be expressed as

The matrix multiplication (Z').sup.T =A(Y').sup.T can be calculated in the same fashion as calculating Y=AX since they are the same matrix multiplication in essence (i.e. matrix A multiplied by an another matrix).

In order to be able to use Eq. (3.17) to compute (Z').sup.T =A(Y').sup.T, (Y').sup.T should be decomposed as four 4.times.4 sub-matrices as done to the matrix X. By taking the transpose on the both sides of the first equation in Eq. (3.17), one can decompose the (Y').sup.T as ##EQU54##

by replacing the X1, X2, X3 and X4 with (X1+I.sub.4 X3).sup.T E.sup.T, (X1-I.sub.4 X3).sup.T O.sup.T, (X2+I.sub.4 X4).sup.T E.sup.T and (X2-I.sub.4 X4).sup.T O.sup.T in the second equation in Eq. (3.17), the matrix multiplication (Z').sup.T =A(Y').sup.T can be computed as ##EQU55##

Consequently, one can compute the 2-D DCT result Z by first solving the matrix (Z').sup.T through Eq. (3.21). Let's define four new 4.times.4 matrices X.sub.++, X.sub.-+, X.sub.+- and X.sub.-- directly from input matrix X as

such that the Eq. (3.21) can be rewritten as ##EQU56##

By decomposing (P1)(Z').sup.T into four 4.times.4 matrices as its top-left, top-right, bottom-left and bottom-right blocks as ##EQU57##

one can finally compute the elements of the 2-D DCT Z=AXA.sup.T through Eq. (3.23) as ##EQU58##

Without the present invention, one would need to compute the matrix product of three 8.times.8 matrices in Z=AXA.sup.T. By using Eq. (3.25), the result matrix Z is decomposed as four 4.times.4 matrices, and each of them can be calculated as the matrix product of three 4.times.4 matrices, the "half-size" operations compared with original 8.times.8 one. The total number of multiplications is reduced from 2N.sup.3 to N.sup.3 since each of the 4.times.4 matrix products requires 2(N/2).sup.3
multiplications when computed by brute force.

3.3 2-D 8.times.8 IDCT Algorithm

Similar to the 2-D DCT, one can also decompose the 2-D IDCT into a much simpler form, and thus reduce the total amount of computation of the 2-D IDCT.

Let's define Z(k,l) and X(m,n) as 2-D IDCT input and output matrices, respectively. The 2-D IDCT definition in Eq. (2.33) can them be rewritten as: ##EQU59##

For N=8, the definition of the coefficient vector W in Eq. (3.2) and the coefficient matrix A in Eq. (3.4) can also be used in the computation of the 2-D IDCT. And Eq. (3.26) can be expressed with the 2-D IDCT matrix notation as

In order to have consistent input and output matrix notations with the 2-D DCT, X represents a 2-D IDCT input matrix and Z represents a 2-D IDCT output matrix in the rest of this section. In this way, the 2-D IDCT matrix expression in Eq. (3.27) will be rewritten again as

The 2-D IDCT in Eq. (3.28) can also be decomposed into two stages of 1-D IDCT as [Ura92, MW95]:

Note that the notation Y is reused to express the result of 1-D IDCT column transform, which is different from the Y in Eq. (3.15) as the result of 1-D DCT column transform.

Because of the symmetric characteristics of the coefficient matrix A, the first stage of 1-D IDCT Y=A.sup.T X can be computed through a permuted matrix ##EQU60##

to reverse the orders of the rows in the bottom half of matrices Y and A.sup.T, where the permutation matrix P5 is defined as ##EQU61##

Thus, the matrix multiplication of Y"=(P5.multidot.A.sup.T)X can be calculated through two 4.times.8 matrices as the functions of the row vectors of matrix X as [Ura92,MW95]: ##EQU62##

where the matrices E and O are the same ones defined in Eq. (3.10).

By defining two 4.times.8 matrices X.sub.e and X.sub.o as the even rows (i.e. X(0), X(2), X(4) and X(6)) and the odd rows (i.e. X(1), X(3), X(5) and X(7)) of matrix X as: ##EQU63##

where matrices P2 and P3 are defined in Eq. (3.7), the 1-D IDCT column transform Y=A.sup.T X can be computed through Eq. (3.30), (3.32) and (3.33) as ##EQU64##

Defining another permutation matrix P6 as ##EQU65##

then the matrices X.sub.e and X.sub.o can be separated into their left and right blocks as

such that the Eq. (3.33) can be rewritten as ##EQU66##

And the 8.times.8 matrix multiplication Y"=(P5.multidot.A.sup.T)X has been simplified as several "half-size" matrix multiplications.

The second stage of 1-D IDCT can be evaluated in the similar way. After multiplying the permutation matrix P5 on the both sides of Z=YA, a new permuted matrix Z" can be defined as

Thus, the transposition of matrix Z" can be expressed as

which happens to have the same format as the first stage of 1-D IDCT Y=ATX does. And the matrix (Y").sup.T can be computed directly by transposing Y" in Eq. (3.3.7) as ##EQU67##

Furthermore, by separating (Y").sup.T into even-left, even-right, odd-left and odd-right four 4.times.4 sub-matrices as done for matrix X in Eq. (3.36), the result of Eq. (3.3.7) can be directly used to express the matrix (Z").sup.T as ##EQU68##

In fact, the even-left, even-right, odd-left and odd-right four 4.times.4 sub-matrices of (Y").sup.T can be computed by replacing the matrix X with the (Y").sup.T in Eq. (3.36) as ##EQU69##

In order to resolve the Eq. (3.42) as a serial of 4.times.4 matrix operations, four new 4.times.4 matrices P7, P8, P9 and P10 can be defined as

By replacing P2 and P3 with P7, P8, P9 and P10, the Eq. (3.42) can be further expressed as ##EQU70##

By defining four 4.times.4 matrices in the following way as ##EQU71##

Eq. (3.44) can be rewritten as

Further substituting Eq. (3.46) into Eq. (3.41) yields ##EQU72##

By decomposing (P5)(Z").sup.T into four 4.times.4 matrices as its top-left, top-right, bottom-left and bottom-right blocks as ##EQU73##

one can finally compute each element of the 2-D IDCT Z=A.sup.T XA through Eq. (3.27) as ##EQU74##

Similar to 2-D DCT, the 2-D IDCT Z=A.sup.T XA is also simplified from three 8.times.8 matrix multiplications to four 4.times.4 matrix multiplications E.sup.T X.sub.ee E, E.sup.T X.sub.oe O, O.sup.T X.sub.eo E and O.sup.T X.sub.oo O, where each of them can be calculated as the matrix product of three 4.times.4 matrices, the "half-size" operations compared with original 8.times.8 one. Since the matrices P7, P8, P9 and P10 used in Eq. (3.45) are all pure permutation matrices, so no extra multiplication has been introduced for computing X.sub.ee, X.sub.oe, X.sub.eo and X.sub.oo,

Generally speaking, each element of matrix Z can be calculated with the same general formula as ##EQU75##

where

3.4 Further Simplification of the 4.times.4 Matrix Multiplications

Eq. (3.25) and Eq. (3.49) show that the core components of the 2-D DCT and 2-D IDCT algorithms are four matrix products, where each of them consists of the matrix multiplications of three 4.times.4 matrices.

Generalizing the product of three 4.times.4 matrices, such as EX.sub.++ E.sup.T, EX.sub.-+ O.sup.T, OX.sub.+- E.sup.T, OX.sub.-- O.sup.T. E.sup.T X.sub.ee E, E.sup.T X.sub.oe O, O.sup.T X.sub.eo E and O.sup.T X.sub.oo O, as:

and the function unit to implement B.sub.4.times.4 U.sub.4.times.4 C.sub.4.times.4.sup.T is considered as a sub-block unit. For V=[v.sub.ij ], B=[b.sub.ij ], U=[u.sub.ij ] and C=[c.sub.ij ], the multiplications of the three 4.times.4 matrices can be carried out by switching the order of the .SIGMA. and combining the b.sub.ik and c.sub.jl together such that each element of matrix V can be determined as: ##EQU76##

From this equation, one can see that each v.sub.ij, 0.ltoreq.i, j.ltoreq.3, is expressed as a sum of products of u.sub.kl.multidot.(b.sub.ik c.sub.jl) for 0.ltoreq.k, 1.ltoreq.3, where u.sub.kl is a function of the input sequence X (see Eq. (3.22) and (3.45) above), and b.sub.ik c.sub.jl is a function of the coefficient matrix A (see Eq. (3.10) above) and can be pre-calculated as one of the .+-.w.sub.m w.sub.n, 1.ltoreq.m,n.ltoreq.7. Since each w.sub.m defined in Eq. (3.2) has 7 possible values, there are total 28 different combinations for pre-calculated constants w.sub.m w.sub.n, 1.ltoreq.m,n.ltoreq.7.

Since b.sub.ik c.sub.jl is a pre-calculated constant, the multiplication u.sub.kl.multidot.(b.sub.ik c.sub.jl) falls into the pattern x.multidot.d, which, in fact, is a variable multiplied by a constant instead of the multiplication between two variables as x.multidot.y. The multiplication between a variable and a constant can be very easily implemented by a group of hardwired adders with not need for using real multiplier.

Further reviewing of multiplication u.sub.kl.multidot.(b.sub.ik c.sub.jl) as a basic processing element (PE) of the proposed 2-D DCT/IDCT algorithm shows that because the computation of u.sub.kl.multidot.(b.sub.ik c.sub.jl) is exactly the same as computing x.multidot.d, this algorithm only suffers one coefficient quantization loss and one computation truncation loss when the X.sub.xx are used to directly compute the final 2-D DCT/IDCT output results in Eq. (3.52) (given there is no truncation loss for all additions and subtractions in Eq. (3.49) for 2-D IDCT). In contrast, the row-column decomposition method suffers at least two coefficient quantization losses and two computation truncation losses--one occurs when computing the 1-D column transform and another one occurs when computing the 1-D row transform, and it is prone to both accumulated errors and error propagation from the first 1-D to the second 1-D transform. Thus, much higher computation accuracy can be achieved by the proposed algorithm given that the same finite wordlength is adopted by both approaches.

Since each 4.times.4 sub-block can be implemented without multipliers, the proposed algorithms can be implemented with only adders and subtractors as the basic processing elements (PE). This results in a great reduction of the complexity and design cost of hardware implementations.

Besides, each 4.times.4 sub-block is totally independent from other sub-blocks in the proposed 2-D DCT and 2-D IDCT algorithm. There is no communication interconnections among the four sub-blocks in either 2-D DCT or 2-D IDCT computation, which means the drawback of complex global communication interconnection associated with other existing direct 2-D DCT and 2-D IDCT approaches has been overcome by the reduction of routing complexity for hardware implementations.

Another advantage brought up by the localized interconnection is that a paralleled architecture can be adopted to implement the four 4.times.4 sub-blocks independently. Parallel data in and parallel data out I/O scheme will guarantee that the system throughput can meet the requirements of current and future video applications.

3.5 Summary

In this section, an algorithm to compute 2-D 8.times.8 DCT/IDCT according to the present invention has been presented. Based on direct 2-D coefficient matrix factorization approach, the 8.times.8 DCT/IDCT can be calculated through four 4.times.4
sub-blocks, which are only "half-size" of the original one.

Further simplification of the core component--4.times.4 sub-block shows that this direct 2-D approach not only is more computation efficient and requires a smaller number of multiplications, but also has localized interconnection and can be easily implemented with paralleled structure to accommodate four independent sub-blocks.

Moreover, each multiplication in this scheme has been confined as a variable multiplied by a constant instead of two variables in general. Every multiplication operation can be very easily fulfilled by a group of hardwired adders and the whole
2-D DCT/IDCT computation can be carried out by using only adders and subtractors. The higher computation accuracy of this scheme means that a shorter finite internal wordlength can be used in the hardware implementation of the algorithm while the same accuracy requirements for both 2-D DCT/IDCT can still be met. A shorter internal finite wordlength means that fewer number of registers and less complicated circuit are required for the hardware implementation.

The simplified processing elements (just adders and subtractors, no multiplier is required), paralleled sub-block structure, localized interconnection and shorter finite internal wordlength associated with the proposed 2-D DCT/IDCT algorithm demonstrate that the proposed algorithm is a perfect candidate for VLSI implementation.

4.0 FINITE WORDLENGTH SIMULATIONS

In this section, finite wordlength simulations of a 2-D DCT/IDCT algorithm according to an embodiment of the present invention are carried out. The simulation results show that the algorithm can meet JPEG 2-D IDCT specification with only 16-bit finite internal wordlength for the arithmetic operations, which means that all additions, subtractions and multiplications required in this algorithm use no more than 16-bit.

4.1 Introduction

In the hardware implementation of any algorithm, there are tight trade-offs among various quantities like accuracy, speed and chip area, etc. For a 2-D DCT or IDCT algorithm, it is coefficient quantization and finite wordlength truncation that are the two major factors which decide the accuracy, speed and chip area.

To represent any cosine coefficient cos(i.pi./16), i=0, 1, . . . , 15, with finite wordlength introduces coefficient quantization (or coefficient approximation) errors. Furthermore, the implementation of any arithmetic operation with finite internal wordlength arithmetic (due to fixed register length) introduces truncation (or rounding) errors. To minimize the effects of quantization errors, more bits are needed to approximate the cosine coefficients cos(i.pi./16), which would require wider inputs for multipliers. To minimize the effect of truncation errors, wider registers are required for each arithmetic operation. Doing so, however, results in a slower critical path a larger chip area for each execution unit. In fact, the optimal coefficient and register width can lead to a higher speed and a smaller chip area. However, both widths should be chosen to ensure the minimum accuracy criteria for 2-D DCT specified by ITU-T Recommendation H.261 and 2-D IDCT specified by the Joint CCITT/ISO committee (JPEG).

For a 2-D DCT, the final result is computed by using Eq. (3.52), where all other quantities can be precisely pre-computed. For a 2-D IDCT, the final result cam be computed by using Eq. (3.52) and (3.49), where both coefficient quantization errors and finite wordlength truncation errors are still determined by Eq. (3.52). So the error analysis will be focused on all the arithmetic operations of Eq. (3.52).

In section 4.2, the coefficient quantization error effects for the algorithm according to the present invention are investigated. In section 4.3, the focus is shifted to the effects of arithmetic operations with different finite internal wordlengths. One optimal candidate for VLSI implementation, which combines optimal coefficient quantization errors and truncation errors, is presented in section 4.4.

4.2 Coefficient Quantization Error Effects

For the coefficients w.sub.1 =(1/2)cos(i.pi./16) in Eq. (3.2) one can factor the constant "1/2" out of each w.sub.i. By using cos(i.pi./16) instead of w.sub.l in Eq. (3.3) and (3.28), one must scale by four, which can be overcome by shifting the final results right for two bits. Let's define new coefficient parameters .OMEGA..sub.ij as:

Since 0<cos(i.pi./16)<1, the .OMEGA..sub.ij would still fall into the range 0<.OMEGA..sub.ij <1, i,j=0, 1, . . . , 7. In Eq. (4.1), the calculation of cos(i.pi./16)cos(i.pi./16) can be precisely pre-computed so that there is no precision loss. And the quantization error is greatly reduced as only one approximation error instead of two approximation errors is associated with each multiplication unit (u.sub.kl.OMEGA..sub.ij).

In rest of the section 4.2, the impact of coefficient quantization errors for the algorithm according to the present invention is investigated. The impact of truncation errors caused by finite wordlength can be overcome by using a total of
31-bit in Eq. (3.52) for both 2-D DCT and 2-D IDCT.

Table 4.1 shows a 16-bit representation for the coefficients .OMEGA..sub.ij, which is the highest quantization precision used in the simulation for the proposed algorithm. The maximum quantization error with 16-bit representation for all .OMEGA..sub.ij is 0.000007.

TABLE 4.1 16-bit Representation of Coefficient .OMEGA..sub.ij .OMEGA..sub.ij (Hex) j = 1 j = 2 j = 3 j = 4 j = 5 j = 6 j = 7 i = 1 0.F642 0.E7F8 0.D0C 0.B18 0.8B7 0.6016 0.30FC 4 B E i = 2 0.DA8 0.C4A 0.A73 0.8366 0.5A8 0.2E24 3 7 D 2 i = 3 0.B0F 0.9683 0.7642 0.5175 0.2987 C i = 4 0.8000 0.6492 0.4546 0.2351 i = 5 0.4F04 0.366 0.1BB D F i = 6 0.257E 0.131D i = 7 0.09B E

4.2.1 2-D DCT Simulation Results

The simulation of quantization errors for an example of a 2-D DCT algorithm according to the present invention is carried out on 10,000 sets of W input data. Each input data is randomly generated within the range of -256 to 255. The final 2-D DCT outputs are rounded to 12-bit integers.

The accuracy requirements for the 2-D DCT simulations are adopted from the H.261 Specification. Each of the W DCT output pixels should be in compliance with the specification for parameters like Peak Pixel Error, Peak Pixel Mean Square Error, Overall Mean Square Error, Peak Pixel Mean Error and Overall Mean Error for each of the 10,000 block data sets generated above. The reference data used in the statistical calculation are generated by the formula in Eq. (2.32). Additionally, the error of DC component is analyzed since it is the most important parameter for 2-D DCT. The simulation results and accuracy requirements of H.261 for 2-D DCT are shown in Table 4.2.

TABLE 4.2 Coefficient quantization effects for 2-DCT Peak Pixel Overall Peak Quantization Peak Mean Mean Pixel Overall Fixed- Length of Pixel Square Square Mean Mean point DC .OMEGA..sub.ij Error Error Error Error Error Error H.261
Spec .ltoreq.1 .ltoreq.0.06 .ltoreq.0.02 .ltoreq.0.015 .ltoreq.0.0015 8-bit 1.3425 0.073803 0.025681 0.005270 0.000109 0 9-bit 0.8204 0.039675 0.011384 0.003949 0.000024 0 10-bit 0.5130 0.010145 0.003495 0.001760 0.000125 0 11-bit 0.2457 0.002973
0.001227 0.001084 0.000080 0 12-bit 0.1137 0.000653 0.000445 0.000690 0.000014 0 13-bit 0.0563 0.000156 0.000116 0.000297 0.000025 0 14-bit 0.0400 0.000068 0.000041 0.000155 0.000003 0 15-bit 0.0221 0.000028 0.000011 0.000092 0.000013 0 16-bit
0.0084 0.000005 0.000002 0.000031 0.000001 0

From above table, one can see that the computation accuracy of the proposed 2-D DCT algorithm drops gradually when the coefficient representations are reduced from 16-bit to 8-bit. And at least 9-bit coefficient representation for .OMEGA..sub.ij, which is equivalent to the 4.5 bits coefficient representation for each cos(i.pi./16), is required in order to meet the 2-D DCT accuracy requirements by H.261. The graphic expressions of the simulation results are also illustrated in FIG. 9.

4.2.2 2-D IDCT Simulation Results

In contrast with 2-D DCT, 2-D IDCT simulations of the proposed algorithm need to be carried out in much more complicated way, which can be summarized as [JPEG, RY90]: (1) Generate random integer data values in the range -L to +H.10,000 block sets of 8.times.8 input data should be generated for (L=300, H=300), (L=256, H=255) and (L=5, H=5), each; (2) For each 8.times.8 input data, 2-D DCT is performed with at least 64-bit floating point accuracy; (3) For each block, the 8.times.8 transformed results are rounded to the nearest integer values and clipped to the range -2048 to +2047; (4) The "reference" 2-D IDCT output results are computed with at least 64-bit floating point accuracy, in which the input data are the data generated in step (3), the output data are clipped to the range -256 to +255; (5) The proposed 2-D IDCT algorithm ("test") is used to compute 2-D IDCT output data with the same input data generated in step (3); (6) For each of the W IDCT output pixels-in 10000 block sets, measure the peak, mean, and mean square errors between the "reference" and "test" data.

The simulations of quantization error effects for proposed 2-D IDCT algorithm are also carried out on 10,000 sets of randomly generated 8.times.8 input blocks. The 2-D IDCT output data are rounded to 9-bit integer through saturation control with .+-.0.5 adjustment (based on the .+-.sign of each number). Several important parameters, such as Peak Pixel Error, Peak Pixel Mean Square Error, Overall Mean Square Error, Peak Pixel Mean Error and Overall Mean Error, are summarized in Table 4.3 for total 10,000 sets simulation. The results calculated with 2-D IDCT formula in Eq. (2.33) have been used as "reference", and the input data range is from -256 to +255.

From Table 4.3, a conclusion similar to the proposed 2-D DCT algorithm can be reached: the computation accuracy of the proposed 2-D IDCT algorithm drops gradually when the coefficient quantization precision is reduced from 16-bit to 8-bit. And at least 9bit coefficient representation for .OMEGA..sub.ij, which is equivalent to the 4.5 bits coefficient representation for each cos(i.pi./16), is required in order to meet the 2-D IDCT accuracy requirements by JPEG. The specification and the results for all required input data ranges, are illustrated in Table 4.4. The graphic expressions of the simulation results are also illustrated in FIG. 10.

TABLE 4.3 Coefficient quantization effects for 2-D IDCT Peak Pixel Overall Quantization Mean Mean Length of Peak Pixel Square Square Peak Pixel Overall .OMEGA..sub.ij Error Error Error Mean Error Mean Error JPEG Spec .ltoreq.1 .ltoreq.0.06
<0.02 .ltoreq.0.015 .ltoreq.0.0015 8-bit 1.7694 0.156273 0.151710 0.005913 0.000003 9-bit 0.4973 0.011707 0.011359 0.002877 0.000018 10-bit 0.2869 0.003702 0.003574 0.001670 0.000007 11-bit 0.1653 0.001346 0.001303 0.000811 0.000007 12-bit
0.1126 0.000545 0.000523 0.000600 0.000007 13-bit 0.0651 0.000204 0.000197 0.000360 0.000004 14-bit 0.0480 0.000127 0.000123 0.000256 0.000013 15-bit 0.0369 0.000095 0.000093 0.000228 0.000016 16-bit 0.0320 0.000086 0.000083 0.000261 0.000006

TABLE 4.4 Simulation results for 2-D IDCT with 9-bit .OMEGA..sub.ij quantization Peak Pixel Overall Input Data Mean Mean Range Peak Pixel Square Square Peak Pixel Overall (-L to +H) Error Error Error Mean Error Mean Error JPEG Spec .ltoreq.1 .ltoreq.0.06 .ltoreq.0.02 .ltoreq.0.015 .ltoreq.0.0015 L = 256, 0.4973 0.011707 0.011359 0.012877 0.000018 H = 255 L = 300, 0.5642 0.016135 0.015625 0.012206 0.000010 H = 300 L = 5, H = 5 0.0449 0.000236 0.000108 0.011574 0.000005

4.3 Truncation Error Effects

In order to determine the truncation errors for an example of an algorithm according to the present invention, different finite internal worldlengths are used in all arithmetic operations whereas the .OMEGA..sub.ij coefficient quantization is kept as fixed 16-bit precision. For both the 2-D DCT and 2-D IDCT, the corresponding maximum finite wordlengths used in Eq. (3.52) are 30-bit.

4.3.1 2-D DCT Simulation Results

The finite wordlength simulations for the proposed 2-D DCT algorithm are carried out on 10,000 sets of randomly generated (in range of -256 to 255) W input data. All the parameters used in this section are the same ones used in section 4.2.1. The simulation results and accuracy requirements of H.261 for 2-D DCT are shown in Table 4.5.

TABLE 4.5 Finite wordlength truncation effects for 2-D DCT Peak Pixel Overall Peak Peak Mean Mean Pixel Overall Fixed- Finite Pixel Square Square Mean Mean point DC Wordlengths Error Error Error Error Error Error H.261 Spec .ltoreq.1
.ltoreq.0.06 .ltoreq.0.02 .ltoreq.0.0015 0.0015 14-bit 1.8750 1.059886 0.097640 0.998788 0.15678 0 15-bit 0.6257 0.021530 0.019566 0.004038 0.000070 0 16-bit 0.3233 0.005382 0.004862 0.001518 0.000078 0 17-bit 0.1593 0.001347 0.001221 0.001022
0.000018 0 18-bit 0.0802 0.000338 0.000307 0.000476 0.000018 0 20-bit 0.0214 0.000025 0.000021 0.000127 0.000001 0 22-bit 0.0092 0.000006 0.000003 0.000089 0.000001 0 24-bit 0.0085 0.000005 0.000002 0.000081 0.000001 0 26-bit 0.0084 0.000005
0.000002 0.000085 0.000001 0 28-bit 0.0084 0.000005 0.000002 0.000062 0.000001 0 30-bit 0.0084 0.000005 0.000002 0.000031 0.000001 0

The simulation results clearly show that truncation errors do not have much effect until the finite wordlengths are reduced to less than 20-bit. Take parameter Peak Pixel Error for example, it has inverse linear relation with the finite wordlengths when they are equal or less than 20-bit, but it hardly changes when the finite wordlengths are more than 20-bit. And at least 15-bit internal wordlength is required in order to meet the 2-D DCT accuracy requirements by H.261. The graphic expressions of the simulation results are also illustrated in FIG. 4.3.

4.3.2 2-D IDCT Simulation Results

The finite wordlength simulations of the proposed 2-D IDCT algorithm are also carried out with 10,000 sets of randomly generated 8.times.8 input blocks. All the parameters used in this section are the same ones used in section 4.2.2. The input data range is from -256 to +255. The same "reference" and "test" results are used to calculate the error statistics. The simulation results and accuracy requirements of JPEG for 2-D IDCT are shown in Table 4.6.

TABLE 4.6 Finite wordlength truncation effects for 2-D IDCT Peak Pixel Overall Mean Mean Finite Peak Pixel Square Square Peak Pixel Overall Wordlengths Error Error Error Mean Error Mean Error JPEG Spec .ltoreq.1 .ltoreq.0.06 .ltoreq.0.02
.ltoreq.0.015 .ltoreq.0.0015 15-bit 1.2864 0.080420 0.078200 0.008168 0.000000 16-bit 0.6514 0.020007 0.019454 0.002817 0.000000 17-bit 0.3277 0.005132 0.004934 0.001865 0.000000 18-bit 0.1529 0.001261 0.001217 0.000626 0.000000 20-bit 0.0607
0.000175 0.000169 0.000406 0.000009 22-bit 0.0363 0.000091 0.000089 0.000341 0.000001 24-bit 0.0323 0.000086 0.000084 0.000329 0.000007 26-bit 0.0320 0.000086 0.000083 0.000319 0.000005 28-bit 0.0320 0.000086 0.000083 0.000229 0.000005 30-bit
0.0320 0.000086 0.000083 0.000261 0.000006

From Table 4.6, a similar conclusion as for the 2-D DCT algorithm can be reached: the computational accuracies of the proposed 2-D IDCT algorithm drop proportional to the finite wordlengths when they are equal to or less than 20-bit, which are illustrated in FIG. 4.4. When the coefficient quantization precision is 16-bit, all the arithmetic operations for the proposed 2-D IDCT can have no more 16-bit finite wordlength and the tput results can still meet JPEG W IDCT specification for all required input data ranges, which is illustrated in Table 4.7.

TABLE 4.7 Simulation results for 2-D IDCT with 16-bit precision Peak Pixel Overall Input Data Mean Mean Range Peak Pixel Square Square Peak Pixel Overall (-L to +H) Error Error Error Mean Error Mean Error JPEG Spec .ltoreq.1 .ltoreq.0.06
.ltoreq.0.02 .ltoreq.0.015 .ltoreq.0.0015 L = 256, 0.6514 0.020007 0.019454 0.002817 0.000000 H = 255 L = 300, 0.6871 0.019927 0.019489 0.003108 0.000000 H = 300 L = 5, H = 5 0.6472 0.019110 0.018542 0.004196 0.000000

4.4 Combined Quantization and Truncation Error Effects

The simulations carried out in section 4.2 and 4.3 show that the coefficient quantization and finite wordlength truncation have different impact on the accuracy of the proposed algorithm. It can be seen in section 4.3 that arithmetic operations with 16-bit finite internal wordlength would keep the truncation errors low enough to meet both H.261 and JPEG's requirements, as long as the coefficient quantization errors are relatively small. Smaller finite wordlength means the