United States Patent5191548
Balkanski , ; et al.March 2, 1993

Title

System for compression and decompression of video data using discrete cosine transform and coding techniques

Abstract

A method and a structure provide discrete cosine transform (DCT) and its inverse (IDCT) using digital FIR filters in a filter bank. The filter bank of the present invention forms a structure of cascaded filters, in which data are communicated only between filters having "parent-child" relationships. Each filter in the filter bank is required only to communicate with at most two other filters in the filter bank. Consequently, in any implementation, both the communication overhead between filters in the filter bank, and the circuit size are minimized. Therefore, the filter bank is particularly suited for integrated circuit implementation. In one embodiment, the filter bank is implemented in an image compression and decompression integrated circuit using a structure which includes pipeline registers, adders and multipliers. In that embodiment, the filter bank provides an 8-point DCT in each of the two passes of a 2-dimensional DCT used in a data compression operation. The same structure also provides an 8-point IDCT in each of the two passes of the 2-dimensional IDCT used in an data decompression operation.


Inventors:Balkanski; Alexandre (Palo Alto, CA), Purcell; Steve  (Mountain View, CA), Kirkpatrick; James  (San Jose, CA)
Assignee:C-Cube Microsystems (Milpitas, CA)
Appl. No.:818403
Filed:January 3, 1992

Current U.S. Class:708/402 708/319 708/401 
Field of Search:364/724.01,724.16,725,726

U.S. Patent Documents
4152772May 1979Sperser et al.
4196448April 1980Whitehouse et al.
4293920October 1981Merola
4791598December 1988Liou et al.
4860097August 1989Hartnack et al.
Other References
Matsumura et al., "An Array Processor for 2-D Discrete Cosine Transforms", Signal Processing III: Theory and Applications, Elsevier Science Publishers B.V. (North Holland) 1986 pp. 1283-1286..~
Primary Examiner: Nguyen; Long T.
Attorney, Agent or Firm:Skjerven, Morrill, MacPherson, Franklin & Friel

Parent Case Text



This application is a continuation application of U.S. patent application, Ser. No. 07/495,583 filed on Mar. 16, 1990, entitled "System for Compression and Decompression of Video Data using Discrete Cosine Transform and Coding Techniques," now abandoned, which is a continuation-in-part application of U.S. patent application, Ser. No. 07/494,242 filed on Mar. 14, 1990, also entitled "System for Compression and Decompression of Video Data Using Discrete Cosine Transform and Coding Techniques."

Claims


We claim:
1. In an image processing apparatus, a finite impulse response digital filter bank for performing an N-point discrete cosine transform on N pixels of an image, where N is an integer, comprises:
a circuit for receiving signal samples representing said N pixels;
N FIR digital filters coupled to said circuit for performing said discrete cosine transform on said signal samples, each of said N FIR digital filter comprising registers and logic circuits for performing arithmetic operations, wherein the k.sup.th filter of said N FIR digital filters has, the Z-transform representation, a system function of the form, ##EQU18## where k=0, 1, 2 . . . , N-1, and wherein k.sup.th filter comprises a plurality of cascaded FIR digital filters, each cascaded FIR digital filter implementing at least one zero of said system function.

2. A FIR digital filter bank as in claim 1 for computing an N-point discrete cosine transform, wherein said plurality of cascaded FIR digital filters of said k.sup.th filter are each a node of a binary tree of FIR digital filters, said binary tree of FIR digital filters constituting said FIR digital filter bank.

3. A FIR digital filter bank as in claim 2 for computing an N-point discrete cosine transform, where said binary tree of FIR digital filters has 1+log.sub.2 N levels, such that the m.sup.th level of said 1+log.sub.2 N levels comprises 2.sup.m filters, said filters of said m.sup.th level being of order N/2.sup.m-1, where m is an integer.

4. A FIR digital filter bank as in claim 3, wherein the value of N is 8.

5. A FIR digital filter bank as in claim 4, wherein said binary tree of FIR digital filters comprises:
a first level of FIR filters having system functions H(z)=z.sup.8 +1 and H(z)=z.sup.8 -1;
a second level of FIR digital filters having system functions H(z)=z.sup.4 +1, H(z)=z.sup.4 -1, ##EQU19## a third level of FIR digital filters having system functions H(z)=z.sup.2 +1, H(z)=z.sup.2 -1, ##EQU20## a fourth level of FIR digital filters having system functions ##EQU21##

6. An apparatus for computing an 8-point discrete cosine transform for a sequence of signal samples x[0] . . . x[7], each sample representing a pixel of an image, comprising:
a first circuit for receiving said signal sample sequence;
means, coupled to said first circuit, for providing signals representing first quantities a[0] . . . a[3], such that
means, coupled to said first circuit, for providing signals representing second quantities b[0] . . . b[3], such that
means, coupled to receive said signals representing said first quantities, for providing signals representing third quantities c[0] and c[1], such that
means, coupled to receive said signals representing said first quantities, for providing signals representing fourth quantities d[0] and d[1], such that
means, coupled to receive said signals representing said second quantities, for providing signals representing fifth quantities e[0] and e[1], such that ##EQU22## means, coupled to receive said signals representing said second quantities, for providing signals representing sixth quantities f[0[ and f[1], such that ##EQU23## means, coupled to receive said signals representing said third quantities, for providing signals representing a seventh quantity g[0], such that g[0]=c[0]+c[1];
means, coupled to receive said signals representing said third quantities, for providing signals representing an eight quantity h[0], such that h[0]=c[0]-c[1];
means, coupled to receive said signals representing said fourth quantities, for providing signals representing a ninth quantity i[0], such that ##EQU24## means, coupled to receive said signals representing said fourth quantities, for providing signals representing a tenth quantity j[0], such that ##EQU25## means, coupled to receive said signals representing said fifth quantities, for providing signals representing an eleventh quantity l[0], such that ##EQU26## means, coupled to receive said signals representing said fifth quantities ad for providing signals representing a twelve quantity m[0], such that ##EQU27## means, coupled to receive said signals representing said sixth quantity, for providing signals representing a thirteenth quantity n[0], such that ##EQU28## means, coupled to receive said signals representing said sixth quantity, for providing a signals representing a fourteenth quantity o[0], such that ##EQU29## means, coupled to receive said signals representing said seventh, eight, ninth, tenth, eleventh, twelfth, thirteen ad fourteenth quantities, for providing output signals representing discrete cosine transform coefficients X[0] . . . X[8], such that ##EQU30## wherein each of said means for providing signals comprises registers and logic circuits for performing arithmetic operations.

7. A method for computing a 8-point discrete cosine transform for a signal sample sequence x[0] . . . x[7], each sample representing a pixel of an image, comprising the steps of:
receiving said signal sample sequence and providing a circuit for computing quantities a[0] . . . a[3], such that
providing a circuit for computing first quantities b[0] . . . b[3], such that
providing a circuit for computing second quantities c[0] and c[1], such that
providing a circuit for computing third quantities d[0] and d[1], such that
providing a circuit for computing fourth quantities e[0] and e[1], such that ##EQU31## providing a circuit for computing fifth quantities f[0] and f[1], such that ##EQU32## providing a circuit for computing a sixth quantity g[0], such that e[0]=c[0]+c[1];
providing a circuit for computing a seventh quantity h[0], such that h[0]=c[0]-c[1];
providing a circuit for computing an eight quantity i[0], such that ##EQU33## providing a circuit for computing a ninth quantity j[0], such that ##EQU34## providing a circuit for computing a tenth quantity l[0], such that ##EQU35## providing a circuit for computing a eleventh quantity m[0], such that ##EQU36## providing a circuit for computing a twelfth quantity n[0], such that ##EQU37## providing a circuit for computing a thirteenth quantity o[0], such that ##EQU38## and providing signals representing discrete cosine transform coefficients X[0] . . . X[8], such that ##EQU39## wherein each of said steps of providing a circuit provides a circuit comprising registers and logic circuits for arithmetic operations.

Description

INDEX

BACKGROUND OF THE INVENTION

DESCRIPTION OF THE PRIOR ART

SUMMARY OF THE INVENTION

BRIEF DESCRIPTION OF THE DRAWINGS

DETAILED DESCRIPTION

Theory

Filter Implementation

Overview of an Embodiment of the Present Invention

Structure and Operation of the Video Bus Controller Unit

Structure and Operation of the Block Memory Unit

Memory Access Modes in the Block Memory Unit

Data Flow in the Discrete Cosine Transform Units

Structure and Operation of the DCT Input Select Unit

Operation of the DCT Input Select Unit During Compression

Operation of the DCT Input Select During Decompression

Structure and Operation of the DCT Row Storage Unit

The In-Line Memory of the DCT Row Storage Unit

Structure and Operation of the DCT/IDCT Processor Unit

Operation of the DCT/IDCT Processor Unit During Compression

Operation of the DCT/IDCT Processor Unit During Decompression

Structure and Operation of the DCT Row/Column Separator Unit

Operation of the DCT Row/Column Separator Unit During Compression

Operation of the DCT Row/Column Separator Unit During Decompression

Structure and Operation of the Quantizer Unit

Structure and Operation of the Zig-Zag Unit

Structure and Operation of the Zero-packer/unpacker unit

Structure and Operation of the Coder/decoder Unit

The Coder Unit

The Decoder Unit

Structure and Operation of the FIFO/Huffman Code Bus Controller Unit

Structure and Operation of the Host Bus Interface Unit

An Application of the Present Invention

BACKGROUND OF THE INVENTION

This invention relates to the compression and decompression of data and in particular to the reduction in the amount of data necessary to be stored for use in reproducing a high quality video picture.

DESCRIPTION OF THE PRIOR ART

In order to store images and video on a computer, the images and video must be captured and digitized. Image capture can be performed by a wide range of input devices, including scanners and video digitizers.

A digitized image is a large two-dimensional array of picture elements, or pixels. The quality of the image is a function of its resolution, which is measured in the number of horizontal and vertical pixels. For example, a standard display of
640 by 480 has 640 pixels across (horizontally) and 480 from top to bottom (vertically). However, the resolution of an image is usually referred to in dots per inch (dpi). Dots per inch are quite literally the number of dots per inch of print capable of being used to make up an image measured both horizontally and vertically on, for example, either a monitor or a print medium. As more pixels are packed into smaller display area and more pixels are displayed on the screen, the detail of the image increases--as well as the amount of memory required to store the image.

A black and white image is an array of pixels that are either black or white, on or off. Each pixel requires only one bit of information. A black and white image is often referred to as a bi-level image. A gray scale image is one such that each pixel is usually represented using 8 bits of information. The number of shades of gray that can thus be represented is therefore equal to the number of permutations achievable on the 8 bits, given that each bit is either on or off, equal to 2.sup.8
or 256 shades of gray. In a color image, the number of possible colors that can be displayed is determined by the number of shades of each of the primary colors, Red, Green and Blue, and all their possible combinations. A color image is represented in full color with 24 bits per pixel. This means that each of the primary colors is assigned 8 bits, resulting in 2.sup.8 .times.2.sup.8 .times.2.sup.8 or 16.7 million colors possible in a single pixel.

In other words, a black and white image, also referred to as a bi-level image, is a two dimensional array of pixels, each of 1 bit. A continuous-tone image can be a gray scale or a color image. A gray scale image is an image where each pixel is allocated 8 -bits of information thereby displaying 256 shades of gray. A color image can be 8-bits per pixel, corresponding to 256 colors or 24-bits per pixel corresponding to 16.7 million colors. A 24-bit color image, often called a true-color image, can be represented in one of several coordinate systems, the Red, Green and Blue (RGB) component system being the most common.

The foremost problem with processing images and video in computers is the formidable storage, communication, and retrieval requirements.

A typical True Color (full color) video frame consists of over 300,000 pixels (the number of pixels on a 640 by 480 display), where each pixel is defined by one of 16.7 million colors (24-bit), requiring approximately a million bytes of memory. To achieve motion in, for example, an NTSC video application, on needs 30 frames per second or two gigabytes of memory to store one minute of video. Similarly, a full color standard still frame image (8.5 by 11 inches) that is scanned into a computer at
300 dpi requires in excess of 25 Megabytes of memory. Clearly these requirements are outside the realm of existing storage capabilities.

Furthermore, the rate at which the data need to be retrieved in order to display motion vastly exceeds the effective transfer rate of existing storage devices. Retrieving full color video for motion sequences as described above (30M bytes/sec) from current hard disk drives, assuming an effective disk transfer rate of about 1 Mbyte per second, is 30 times too slow; from a CD-ROM, assuming an effective transfer rate of 150 bytes per second, is about 200 times too slow.

Therefore, image compression techniques aimed at reducing the size of the data sets while retaining high levels of image quality have been developed.

Because images exhibit a high level of pixel to pixel correlation, mathematical techniques operating upon the spatial Fourier transform of an image allow a significant reduction of the amount of data that is required to represent an image; such reduction is achieved by eliminating information to which the eye is not very sensitive. For example, the human eye is significantly more sensitive to black and white detail than to color detail, so that much color information in a picture may be eliminated without degrading the picture quality.

There are two means of image compression: lossy and lossless. Lossless image compression allows the mathematically exact restoration of the image data. Lossless compression can reduce the image data set by about one-half. Lossy compression does not preserve all information but it can reduce the amount of data by a factor of about thirty (30) without affecting image quality detectable by the human eye.

In order to achieve high compression ratios and still maintain a high image quality, computationally intensive algorithms must be relied upon. And further, it is required to run these algorithms in real time for many applications.

In fact, a large spectrum of applications requires the following:

(i) the real-time threshold of 1/30th of a second, in order to process frames in a motion sequence; and

(ii) the human interactive threshold of under one (1) second, that can elapse between tasks without disrupting the workflow.

Since the processor capable of compressing a 1 Mbyte file in 1/30th of a second is also the processor capable of compressing a 25 Mbyte file--a single color still frame image--in less than a second, such a processor will make a broad range of image compression applications feasible.

Such a processor will also find application in high resolution printing. Since having such a processor in the printing device will allow compressed data to be sent from a computer to a printer without requiring the bandwidth needed for sending non-compressed data, the compressed data so sent may reside in an economically reasonable amount of local memory inside the printer, and printing may be accomplished by decompressing the data in the processor within a reasonable amount of time.

Numerous techniques have been proposed to reduce the amount of data required to be stored in order to reproduce a high quality picture particularly for use with video displays. Because of the high cost of memory, the ability to store a given quality picture with minimal data is not only important but also greatly enhances the utility of computer system utilizing video displays. Among the work done in this area is work by Dr. Wen Chen as disclosed in U.S. Pat. Nos. 4,302,775, 4,385,363,
4,394,774, 4,410,916, 4,698,672 and 4,704,628. One technique for the storage of data for use in reproducing a video image is to transform the data into the frequency domain and store only that information in the frequency domain which, when the inverse transform is taken, allows an acceptable quality reproduction of the space varying signals to reproduce the video picture. Dr. Herbert Lohscheller's work as described in European Patent Office Application No. 0283715 also describes an algorithm for providing data compression.

Dr. Chen's U.S. Pat. No. 4,704,628 alluded to in the above described data transmission/receiving system uses intraframe and interframe transform coding. In intraframe and interframe transform coding, rather than providing the actual transform coefficients as output, the output encoded data are block-to-block difference values (intraframe) and frame-to-frame difference values (interframe). While coding differences rather than actual coefficients reduce the bandwidth necessary for transmission, large amounts of memory for storage of prior blocks and prior frames are required during the compression and decompression processes. Such systems are expensive and difficult to implement, especially on an integrated circuit implementation where "real estate" is a premier concern.

U.S. Pat. No. 4,385,363 describes a discrete cosine transform processor for 16 pixel by 16 pixel blocks. The 5-stage pipeline implementation disclosed in the '363 patent is not readily usable for operation with 8 pixel by 8 pixel blocks. Furthermore, Chen's algorithm requires global shuffling at stages 1, 4 and 5.

Despite the prior art efforts, the information which must be stored to reproduce a video picture is still quite enormous. Therefore, substantial memory is required particularly if a computer system is to be used to generate a plurality of video images in sequence to replicate either changes in images or data. Furthermore, the prior art has also failed to provide a processor capable of processing video pictures in real time.

SUMMARY OF THE INVENTION

The present invention provides a data compression/decompression system capable of significant data compression of video or still images such that the compressed images may be stored in the mass storage media commonly found in conventional computers.

The present invention also provides

(i) a data compression/decompression system which will operate at real time speed, i.e. able to compress at least thirty frames of true color video per second, and to compress a full-color standard still frame (8.5".times.11.times. at 300 dpi) within one second;

(ii) a system adhering to an external standard so as to allow compatibility with other computation or video equipment;

(iii) a data compression/decompression system capable of being implemented in an integrated circuit chip so as to achieve the economic and portability advantages of such implementation.

In accordance with this invention, a data compression/decompression system using a discrete cosine transform is provided to generate a frequency domain representation of the spatial domain waveforms which represent the video image. The discrete cosine transform may be performed by finite impulse response (FIR) digital filters in a filter bank. In this case, the inverse transform is obtained by passing the stored frequency domain signals through FIR digital filters to reproduce in the spatial domain the waveforms comprising the video picture. Thus, the advantage of simplicity in hardware implementation of FIR digital filters is realized. The filter bank according to this invention possesses the advantages of linear complexity and local communication. This system also provides Huffman coding of the transform domain data to effectuate large data compression ratios. This system may be implemented as an integrated circuit an may communicate with a host computer using an industry standard bus provided in the data compression/decompression system according to the present invention. Accordingly, by combining in hardware a novel discrete cosine transform algorithm, quantization and coding steps, minimal data are required to be stored in real time for subsequent reproduction of a high quality replica of an original image.

This invention will be more fully understood in conjunction with the following detailed description taken together with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1-1 and 1-2 form FIG. 1 which shows a block diagram of an embodiment of the present invention.

FIG. 2 shows a schematic diagram of the video bus controller unit 102 of the embodiment shown in FIG. 1.

FIGS. 3-1 and 3-2 form FIG. 3 which shows a block diagram of the block memory unit 103 of the embodiment shown in FIG. 1.

FIGS. 4a-1 and 4a-2 form FIG. 4a which shows a data flow diagram of the Discrete Cosine Transform (DCT) units, consisting of the units 103-107 of the embodiment shown in FIG. 1.

FIGS. 4b-1 to 4b-4 form FIG. 4b which shows the schedule of 4:1:1 data flow in the DCT units under compression condition.

FIGS. 4c-1 and 4c-2 form FIG. 4c which shows the schedule of 4:2:2 data flow in the DCT units under compression condition.

FIGS. 4d-1 to 4d-4 form FIG. 4d which shows the schedule of 4:1:1 data flow in the DCT units under decompression condition.

FIGS. 4e-1 and 4e-2 form FIG. 4e which shows the schedule of 4:2:2 data flow in the DCT units under decompression condition.

FIGS. 5a-1 to 5a-4 form FIG. 5a which shows a schematic diagram of the DCT input select unit 104 of the embodiment shown in FIG. 1.

FIGS. 5b-1 to 5b-3 form FIG. 5b which shows the schedule of control signals of the DCT input select unit 104 under compression condition, according to the clock phases.

FIGS. 5c-1 to 5c-4 form FIG. 5c which shows the schedule of control signals of the DCT input select unit 104 under decompression condition, according to the clock phases.

FIGS. 6a-1 and 6a-2 form FIG. 6a which shows a schematic diagram of the DCT row storage unit 105 of the embodiment shown in FIG. 1.

FIG. 6b shows a horizontal write pattern of the memory arrays 609 and 610 in the DCT row storage unit 105 of FIG. 6a.

FIG. 6c shows a vertical write pattern of the memory arrays 609 and 610 in the DCT row storage unit 105 of FIG. 6a.

FIGS. 7a-1 and 7a-2 form FIG. 7a which shows a schematic diagram of the DCT/IDCT processor unit 106 of the embodiment shown in FIG. 1.

FIG. 7b shows a flow diagram of the DCT computational algorithm used under compression condition in the DCT/IDCT processor unit 105 of FIG. 7a.

FIGS. 7c-1 to 7c-4 form FIG. 7c which shows the data flow schedule of the DCT computational algorithm used under compression condition in the DCT/IDCT processor unit 105 of FIG. 7a.

FIGS. 7d-1 to 7d-3 form FIG. 7d which shows the schedule of control signals of the DCT/IDCT processor unit 105 shown in FIG. 7a under compression condition.

FIG. 7e shows a flow diagram of the DCT computational algorithm used under decompression condition in the DCT/IDCT processor unit 105 of FIG. 7a.

FIGS. 7f-1 to 7f-4 form FIG. 7f which shows the data flow schedule of the DCT/IDCT processor unit 105 of FIG. 7a under decompression condition.

FIGS. 7g-1 to 7g-3 form FIG. 7g which shows the schedule of control signals of the DCT/IDCT processor unit shown in FIG. 7a under decompression condition.

FIGS. 8a-1 to 8a-3 form FIG. 8a which shows a schematic diagram of the DCT row/column separator unit 107 in the embodiment shown in FIG. 1.

FIGS. 8b-1 to 8b-6 form FIG. 8b which shows the schedule of control signals of the DCT row/column separator unit 107 under decompression condition.

FIGS. 8c-1 to 8c-6 form FIG. 8c which shows the schedule of control signals of the DCT row/column separator unit 107 shown in FIG. 7a under decompression condition.

FIGS. 9-1 and 9-2 form FIG. 9 which shows a schematic diagram of the quantizer unit 108 in the embodiment shown in FIG. 1.

FIG. 10 shows a schematic diagram of the zig-zag unit 109 in the embodiment shown in FIG. 1.

FIG. 11 shows a schematic diagram of the zero pack-unpack unit 110 in the embodiment shown in FIG. 1.

FIG. 12a shows a schematic diagram of the coder unit 11a of the coder/decoder unit 111 in the embodiment shown in FIG. 1.

FIGS. 12b-1 and 12b-2 form FIG. 12b which shows a block/diagram of the decoder unit 111b of the coder/decoder unit 111 in the embodiment shown in FIG. 1.

FIGS. 13a-1 to 13a-3 form FIG. 13a which shows a schematic diagram of the FIFO/Huffman code controller unit 112 shown in the embodiment shown in FIG. 1.

FIG. 13b shows the memory maps of the FIFO Memory 114 of the preferred embodiment in FIG. 1, under compression and decompression conditions.

FIGS. 14-1 to 14-3 form FIG. 14 which shows a schematic diagram of the host bus interface unit 113 in the embodiment shown in FIG. 1.

FIG. 15a shows a filter tree used to perform a 16-point discrete Fourier transform (DFT).

FIGS. 15b-1 to 15b-4 form FIG. 15b which shows the system functions of the filter tree shown in FIG. 15a.

FIGS. 15c-1 to 15c-4 form FIG. 15c which shows the steps of derivation from the system functions of the filter tree in FIG. 15a to a flow diagram representation of the algebraic operations of the FIR digital filter bank.

FIGS. 15d-1 and 15d-2 form FIG. 15d which shows the flow diagram resulting from the derivation shown in FIG. 15c.

FIGS. 15c-1 and 15c-2 form FIG. 15c which shows the flow diagram of the inverse discrete cosine transform, as a result of reversing the algebraic operations of the flow diagram of FIG. 15d.

FIG. 16 shows a scheme by which the speed of data compression and decompression achieved by the present invention may be used to provide image reproduction sending only compressed data over the communication channel.

DETAILED DESCRIPTION

Data compression for image processing may be achieved by (i) using a coding technique efficient in the number of bits required to represent a given image, (ii) by eliminating redundancy, and (iii) by eliminating portions of data deemed unnecessary to achieve a certain quality level of image reproduction. The first two approaches involve no loss of information, while the third approach is "lossy". The amount of information loss acceptable is dependent upon the intended application of the data. For reproduction of image data for viewing by humans, significant amounts of data may be eliminated before noticeable degradation of image quality results.

According to the present invention, data compression is achieved by use of Huffman coding (a coding technique) and by elimination of portions of data deemed unnecessary for acceptable image reproduction. Because sensitivities of human vision to spatial variations in color and image intensity have been studied extensively in cognitive science, these characteristics of human vision are available for data compression of images intended for human viewing. In order to reduce data based on spatial variations, it is more convenient to represent and operate on the image represented in the frequency domain.

This invention performs data compression of the input discrete spatial signals in the frequency domain. The present method transforms the discrete spatial signals into their frequency domain representations by a Discrete Cosine Transform (DCT). The discrete spatial signal can be restored by an inverse discrete cosine transform (IDCT).

Theory

A discrete spatial signal can be represented as a sequence of signal sample values written as:

x[n] denotes a signal represented by N signal sample values at N point in space. The N-point DCT of this spatial signal is defined as ##EQU1## a method of computing the DCT of x[n] is derived and illustrated in the following:

F1. The discrete spatial signal x[n] is shifted by 1/2 sample in the increasing n direction and mirrored about n=N to form to form the resulting signal x[n], written as: ##EQU2##

F2. A 2N-point discrete Fourier Transform (DFT) is applied to the signal x[n]. The transformed representation of x[n] is written as: ##EQU3##

P3. Because of relations (1) and (2), the DCT of x[n], i.e., X[k], is readily obtained by setting X[k] to zero for k.gtoreq.N (truncation), or ##EQU4## Furthermore, the frequency domain representation of x[n], i.e. X[k], has the following properties ##EQU5## Therefore, as will be shown below, despite truncation in step F3 the inverse transformation can be obtained using the information of (3), (4) and (5).

The inverse transformation, hence, follows the steps:

I1. The sequence X[k] is reconstructed from X[k] by a mirroring X[k] about k=N, and scaling appropriately, i.e. ##EQU6## (using relations (3), (4) and (5))

I2. The 2N-point inverse discrete Fourier transform (IDFT) is then applied to X[k]. ##EQU7##

I3. Finally, x[n] may be obtained by setting x[n] to zero for n.gtoreq.N and shifting the signal by 1/2 sample in the decreasing n direction, i.e. ##EQU8##

Filter Implementation

The Discrete Cosine Transform (DCT) and its inverse outlined in steps F1-F3 and I1-I3 steps discussed in the theory section above can be realized by a set of finite impulse response (FIR) digital filters. As discussed in the theory section above, DCT, and similarly IDCT, may be obtained through the use of a DFT or an inverse DFT at steps F2 and I2 respectively.

Because DFT, and similarly its inverse, can be seen as a system of linear equations of the form: ##EQU9## the transform can be seen as being accomplished by a bank of filters, one filter for each value of k (forward DFT) or n (inverse DFT). The system function (z-transform of a filter's unit sample response) of each filter may be generally written as,

(a) H.sub.k (Z) in the forward DFT, for the kth filter, ##EQU10## or equivalently, ##EQU11##

The last formulation (P1) specifically points out that the 2N-1 zeroes of the kth filter lie on the unit circle of the Z-plane, separated .pi./N radially, except for l=k which is not a zero of the filter.

(b) Similarly, the system function G.sub.n (Z) for the inverse DFT in the nth filter, ##EQU12## Again, it can be seen that the zeroes of the nth filter in the inverse DFT transform lie on the unit circle separated by .pi./N radially, except for the l=n. The structure of equations P1 and P2 suggests that both forward and inverse DFTs may be implemented by the same filter banks with proper scaling (noting that P1 and P2 has identical zeroes for any k=n).

The representation of P1 suggests a "recursive" implementation of the FIR filter, i.e. the FIR filter may be formed by cascading 2N-1 single-point filters, each having a zero at a different integral multiple of e.sup.j.pi.k/N or e.sup.j.pi.k/N. For example, we may rewrite the kth (forward) or nth (inverse) filter as ##EQU13## where R.sup.l is the l.sup.th zero, ##EQU14## Furthermore, we may write

where P.sub.mk (z) denotes a FIR filter having 2N-2 zeroes spaced .pi./N apart, except for l=k,m. Here, P.sub.k (z) is represented as a cascade of a 2N-2 point filter P.sub.mk (z) and a single point filter having a zero at R.sup.m.

In the same way, P.sub.k (z) may also be decomposed into a cascade of a 2N-3 point FIR filter P.sub.mnk (z) and a 2-point filter having zeros at R.sup.m and r.sup.n. P.sub.mnk (z) may itself be implemented by cascading lower order FIR filters.

A 16-point DFT may be implemented by the FIR filter tree 1500 shown in FIG. 15a by selectively grouping FIR filters.

The grouping of filters shown in FIG. 15a is designed to minimize the number of intermediate results necessary to complete the DFT. A filter is characterized by its system function, and referred to as an N-th order filter if the leading term of the polynomial representing the system function is of power N. As shown in FIG. 15b, the two filters 1501 and 1502 in the first filter level are 8th order filters, i.e. the leading term of the power series representing the system function is a multiple of z.sup.8. The four filters 1503-1506 in the second level of filters are 4th order filters, and the eight filters 1507-1514 in the third level of filters are 2nd order filters. In general, a N-point DFT may be implemented by this method using (1+log.sub.2 N) levels of filters with the kth level of filters having 2.sup.k filters, each being of order N/2.sup.k-1, and such that the impulse response of each filter possesses either odd or even symmetry. Under this grouping scheme, the number of arithmetic operations are minimized because many filter coefficients are zero, and many multiplications are trivial (involving 1, -1, or a limited number of constants cos.pi.l/N, where l is an integer). These properties lend to simplicity of circuit implementation. Furthermore, as will be shown in the following, computation at each level of filters involves only output data of the previous level, and, treating each filter as a node in a tree structure, specifically each child node depends only on output data of the immediate parent node. Therefore, no communication is required between data output of filters not in a "parent-child" relationship. This property results in "local connectivity" essential for area efficiency in an integrated circuit implementation. This filter tree 1500 has the following properties:

(i) all branches have the same number of zeros; and

(ii) all stages have the same number of zeros. These properties provide the advantages of locally connected filters ("local connectivity") and a maximum number of filters from which data must be supplied ("fan out") of two. The property of local connectivity, defined below, minimizes communication overhead. Minimum fan out of two allows a compact implementation in integrated circuits requiring high space efficiency.

In FIG. 15a, each rectangular box represents a filter having the zeroes W.sup.l, for the values of l shown inside the box. W is e.sup.j.pi.k/N or e.sup.-j.pi.n/N dependent upon whether DCT or IDCT is computed. Recalling that, in order to obtain DCT from DFT, at steps F3 and I3, the DFT results for k.gtoreq.N (forward) or n.gtoreq.N are set to zero. Hence, only the portions of this filter tree that yield DFT results for k<N (forward) and n<N (inverse) need be implemented. The required DFT results are each marked in FIG. 15a with a "check".

The system functions for the forward transform filters are shown in FIG. 15b. Because of the symmetry in the input sequence and in the system function of the FIR filters, tracking carefully the intermediate values and eliminating duplicate computation of the same value, the flow graph of FIG. 15c is obtained. FIG. 15c illustrates these tracking steps by following the computation of the first three stages in the filter tree 1500 shown in FIG. 15a. Recall that at step F1, the input sequence X[n] is mirrored about n=N to obtain the input sequence X[n] to the 16-point DFT. Therefore x[n] is x[0], x[1], x[2]. . . x[7], x[7], x[6], . . . , x[0]. This sequence is used to compute the 8-point DCT. As shown in FIG. 15c. the filter
1501 has system function H(Z)=Z.sup.8 +1; hence, the first eight output data a[0]. . . a[7] are each the sum of two samples of the input sequence, each sample being 8 unit "delays" apart, e.g. a[0]=x[0]+x[7]; a[1]=x[1]+x[6] etc. (These delays are not delays in time, but a distance in space since x[n] is a spatial sequence.) Because of the symmetry of the input sequence x[n], a[0]. . . a[7] are symmetrical about n=31/2. Therefore, when implementing this filter 1501, only the first four values a[0]. . . a[3] need actually be computed, a[4]. . . a[7] having values corresponding respectively to a[3]. . . a[1]. Computation of a[0]. . . a[3] is provided in the first four values of stage 2 shown in FIG. 15d. The operations to implement filter 1501
are shown in FIG. 15c.

The same procedure is followed for filter 1502. Filter 1502, however, possesses odd symmetry, i.e. b[0]=-b[7]; b[1]=-b[6] etc. For most implementations, including the embodiment described below, the algebraic sign of an intermediate value may be provided at a later stage when the value is used for a subsequent operation. Thus, in filter 1502, as in filter 1501, only the first four values b[0]. . . b[3] need actually be computed, since b[4]. . . b[7] may be obtained by a sign inversion of the values b[3]. . . b[0] respectively at a subsequent operation. The operations to implement 1502 are shown in FIG. 15c. Hence, the bottom four values at stage 2 shown in FIG. 15d are provided for computation of values b[0]. . . b[3].

Accordingly, by mechanically tracking the values computed at the previous stages, and noting the symmetry of each filter, the operations required to implement filters 1503-1514 are determined in the same manner as described above for filter 1501
and 1502, the result of the derivation is the flow diagram shown in FIG. 15d.

Finally, because of the symmetry of the output in filters 1507-1514, and the symmetry in filters 1515-1530, the required output data X[0]. . . X[7] are obtained by multiplying g[0], h[0], i[0]. . . o[0] by ##EQU15## respectively.

The inverse transform flow diagram FIG. 15e is obtained by reversing the algebraic operations of the forward transform flow diagram in FIG. 15d.

Thus intermediate results s1-s7 at stage 2 in FIG. 15e are given by reversing the algebraic operations for obtaining x(0)-x(7) at stage 8 of FIG. 15d. That is, ignoring for the moment a factor 1/2. ##EQU16##

(In general, the scale factors, such as the 1/2 above, may be ignored because they are recaptured by output scaling). The same process is repeated by reversing the intermediate results s1-s7 at stage 6 of FIG. 15d to derive intermediate results p1-p7 at stage 4 of FIG. 15e. The intermediate results z1-z7, y1-y7 are similarly derived and additional intermediate results are then derived until the final values x(0)-x(7) are derived. The process is summarized below: ##EQU17## computation algorithm may be measured in two dimensions: (i) computational complexity and (ii) communication requirements. According to the present invention, the computational complexity of the DCT, measured by the number of multiplication steps needed to accomplish the DCT, taking into consideration of the throughput rate, is of order N (i.e. linear), where N is the number of points in the DCT. As discussed above, the tree structure of the filter bank results in a maximum fan out of two, which allows all communication to be "local" (i.e. data flows from the root filters--in other words, highest order filters--and no communication is required between filters not having parent-child relationship in the tree structure as described above in conjunction with FIG. 15a).

Overview of An Embodiment of the Present Invention

An embodiment of the present invention implements the "baseline" algorithm of the JPEG standard. A concise description of the JPEG standard is attached as Appendix A. FIG. 1 shows the functional block diagram of this embodiment of the present invention. This embodiment is implemented in integrated circuit form; however, the use of other technologies to implement this architecture, such as by discrete components, or by software in a computer is also feasible.

The operation of this embodiment during data compression (i.e. to reduce the amount of data required to represent a given image) is first functionally described in conjunction with FIG. 1.

FIG. 1 shows, in schematic block diagram form, a data compression/decompression system in accordance with this invention.

The embodiment in FIG. 1 interfaces with external equipment generating the video input data via the Video Bus Interface unit 102. Because the present invention provides compression and decompression (playback) of video signals in real-time, synchronization circuits 102-1 and 113-2 are provided for receiving and providing respectively synchronization signals from and to the external video equipment (not shown).

Video Bus Interface unit (VBIU) 102 accepts 24 bits of input video signal every two clock periods via the data I/O lines 102-2. The VBIU 102 also provides a 13-bit address on address liens 102-3 for use with an external memory buffer, at the user's option which provides temporary storage of input (compression) or output (decompression) data in "natural" horizontal line-by-line video data format used by many in video equipment. During compression, the horizontal line-by-line video data is read in as 8.times.8 pixel blocks for input to VBIU via I/O bus 102-2 according to addresses generated by VBIU 102 on bus 102-3. During decompression, the horizontal line-by-line video data is made available to external video equipment by writing the
8.times.8 pixel blocks output from VBIU 102 on bus 102-2 into proper address locations for horizontal line-by-line output. Again, the address generator inside VBIU 102 provides the proper addresses.

VBIU 102 accepts four external video data formats: color format (RGB) and three luminance-chrominance (YUV) formats. The YUV formats are designated YUV 4:4:4, YUV 4:2:2, and YUV 4:1:1. The ratios indicate the ratios of the relative sampling frequencies in the luminance and the two chrominance components. In the RGB format, each pixel is represented by three intensities corresponding to the pixel's intensity in each of the primary colors red, green and blue. In the YUV representations, three numbers, Y, U and V represent respectively the luminance index (Y component) and two chrominance indices (U and V components) of the unit. In the JPEG standard, groups of 64 pixels, each expressed as an 8.times.8 matrix, are compressed or decompressed at a time. The 64 pixels in the RGB and YUV 4:4:4 formats occupy on the physical display an 8.times.8 area in the horizontal and vertical directions. Because human vision is less sensitive towards colors than intensity, it is adequate in some applications to provide in the U and V components of the YUV 4:2:2 and YUV 4:1:1 formats, U and V type data expressed as horizontally averaged values over areas of 16 pixels by 8 pixels and 32 pixels by 8 pixels respectively. An 8.times.3 matrix in the spatial domain is called a "pixel" matrix, and the counterpart 8.times.8 matrix in the transform domain is called a "frequency" matrix.

Although RGB and YUV 4:4:4 formats are accepted as input, they are immediately reduced to representations in YUV 4:2:2 format. RGB data is first transformed to YUV 4:4:4 format by a series of arithmetic operations on the RGB data. YUV 4:4:4
data are converted into YUV 4:2:2 data in the VBIU 102 by averaging neighboring pixels in the U, V components. This operation immediately reduces the amount of data to be processed by one-third. As a result, the circuit in this embodiment of the present invention needs only to process YUV 4:2:2 and YUV 4:1:1 formats. As mentioned hereinabove, the JPEG standard implements a "lossy" compression algorithm; the video information lost due to translation of the RGB and YUV 4:4:4 formats to the YUV
4:2:2 format is not considered significant for purposes under the JPEG standard. In the decompression mode, the YUV 4:4:4 format is restored by providing the average value in place of the sample value discarded in the compression operation. RGB format is restored from the YUV 4:4:4 format by a series of arithmetic operation on the YUV 4:4:4 data to be described below.

As a result of the processing in the VBIU unit 102, video data are supplied to the block memory unit 103, at 16 bits (two values) per clock period. The block memory unit 103 is a buffer for the incoming stream of 16-bit video data to be sorted into 8.times.8 blocks (matrices) of the same pixel type (Y, U or V). This buffering step is also essential because the discrete cosine transform (DCT) algorithm implemented herein is a 2-dimensional transform, requiring the video signal data to pass through the DCT/IDCT processor unit 106 twice, one for each spatial direction (horizontal and vertical). Intermediate data are obtained after the video input data pass through DCT/IDCT processor unit 106 once. Consequently, DCT/IDCT processor unit 106
must multiplex between video input data and the intermediate results after the first-pass DCT operation. To minimize the number of registers needed inside the DCT unit 106, and also to simplify the control signals within the DCT unit 106, the sequence in which the elements of the pixel matrix is processed is significant.

The sequencing of the input data, and of the intermediate data after first-pass of the 2-dimensional DCT, for DCT/IDCT processor unit 106 is performed by the DCT input select unit 104. DCT input select unit 104 alternatively selects, in predetermined order, either two 8-bit words from the block memory unit 103 or two 16-bit words from the DCT row storage unit 105. The DCT row storage unit 105 contains the intermediate results after the first pass of the data through the 2-dimensional DCT. The data selected by DCT input select unit 104 is processed by the DCT/IDCT processor unit 106. The results are either, in the case of data which completed the 2-dimensional DCT, forwarded to the quantizer unit 108, or, in the case of first-pass DCT data, recycled via DCT row storage unit 105 for the second pass of the 2-dimensional DCT. This separation of data to supply either DCT row storage unit 105 or quantizer unit 108 is achieved in the DCT row/column separator unit 107. The result of the DCT operation yields two 16-bit data every clock period. A double-buffering scheme in the DCT row/column separator 107 provides a continuous stream i.e. 16 bits each clock cycle of 16-bit output data from DCT row/column separator data 107 into the quantizer data 108.

The output data from the 2-dimensional DCT is organized as an 8 by 8 matrix, called a "frequency" matrix, corresponding to the spatial frequency coefficients of the original 8 by 8 pixel matrix. Each pixel matrix has a corresponding frequency matrix in the transform (frequency) domain as a result of the 2-dimensional DCT operation. According to its position in the frequency matrix, each element is multiplied in the quantizer 108 by a corresponding quantization constant taken from the YUV quantization table 108-1. Quantization constants are obtained from an international standard body, i.e. JPEG; or, alternatively, obtained from a customized image processing function supplied by a host computer to be applied on the present set of data. The quantizer unit 108 contains a 16-bit by 16-bit multiplier for multiplying the 16-bit input from the row/column separator unit 107 to the 16-bit quantization constant from the YUV quantization table 108-1. The result is a 32-bit value with bit 31 as the most significant bit and bit 0 as the least significant bit. In this embodiment, to meet the dual goals of allowing a reasonable dynamic range, and of minimizing the number of significant bits for simpler hardware implementation, only 8 bits in the mid-range are preserved. Therefore, a 1 is added at position bit 15 in order to round up the number represented by bits 31 through 16. The eight most significant bits, and the sixteen least significant bits of this 32-bit multiplication result are then discarded. The net result is an 8-bit value which is passed to the zig-zag unit 109, to be described below. Because the quantization step tends to set the higher frequency components of the frequency matrix to zero, the quantization unit 108 acts as a low-pass digital filter. Because of the DCT algorithm, the lower frequency coefficients of the luminance (Y) or chrominance (U, V) in the original image are represented in the lower elements of the respected frequency matrices, i.e. element A.sub.ij represents higher frequency coefficients of the original image than element A.sub.mn, in both horizontal and vertical directions, if i>m and j>n.

The zig-zag unit 109 thus receives an 8-bit datum every clock period. Each datum is a quantized element of the 8 by 8 frequency matrix. As the data come in, they are individually written into a location of a 64-location memory array each location representing an element of the frequency matrix. As soon as the memory array is filled, it is read out in a manner corresponding to reading an 8 by 8 matrix in a zig-zag manner starting from the 00 position (i.e., in the order A.sub.00, A.sub.10, A.sub.01, A.sub.02, A.sub.11, A.sub.20, A.sub.30, A.sub.21, A.sub.12, A.sub.03, etc.). Because the quantization steps tend to zero higher frequency coefficients, this method of reading the 8 by 8 frequency matrix is most likely to result in long runs of zeroed frequency coefficients, providing a convenient means of compressing the data sequence by representing a long run of zeroes as a run length rather than individual values of zero. The run length is encoded in the zero packer/unpacker unit of 110.

Because of double-buffering in the zig-zag unit 109 providing for accumulation of the current 64 8-bit values and simultaneous reading out of the prior 64 8-bit values in run-length format, a continuous stream of 8-bit data is made available to the zero packer/unpacker unit 110. This data stream is packed into a format of the pattern: DC-AC-RL-AC-RL . . . , which represents in order the sequence: a DC coefficient, an AC coefficient, a run of zeroes, an AC coefficient, a run of zeroes, etc. (Element A.sub.00 of matrix A is the DC coefficient, all other entries are referred to as AC coefficients). This data stream is then stored in a first-in, first-out (FIFO) memory array 114 for the next step of encoding into a compressed data representation. The compressed data representation in this instance is Huffman codes. This memory array 114 provides temporary storage, which content is to be retrieved by the coder/decoder unit 111 under direction of a host computer through the host interface 113. In addition to storage of data to be encoded, the FIFO memory 114 also contains the translation look-up tables for the encoding. The temporary storage in FIFO memory 114 is necessary because, unlike the previous signal processing step on the incoming video signal (which is provided to the VBIU 102 continuously and which must be processed in real time) by functional units 102 through 110, the coding step is performed under the control of an external host computer, which interacts with this embodiment of the present invention asynchronously through the host bus interface 113.

Writing and reading out of the FIFO memory 114 is controlled by the FIFO/Huffman code bus controller unit 112. In addition to controlling reading and writing of zero-packed video data into FIFO memory 114, the FIFO/Huffman code bus controller
112 accesses the FIFO memory 114 for Huffman code translation tables during compression, and Huffman decoding tables during decompression. The use of Huffman code is to conform to the JPEG standard of data compression. Other coding schemes may be used at the expense of compatibility with other data compression devices using the JPEG standard.

The FIFO/Huffman code bus controller unit 112 services requests of access to the FIFO memory 114 from the zero packer/unpacker unit 110, and from coder/decoder unit 111. Data are transferred into and out of FIFO memory 114 via an internal bus
116. Because of the need to service in real time a synchronous continuous stream of video signals coming in through the VBIU 102 during compression, or the corresponding outgoing synchronous stream during decompression, the zero packer/unpacker unit 110
is always given highest priority into the FIFO memory 114 over requests from the coder/decoder unit 111 and the host computer.

Besides requesting the FIFO/Huffman code bus controller unit 112 to read the zero-packed data from the FIFO memory 114, the coder/decoder unit 111 also translates the zero-packed data into Huffman codes by looking up the Huffman code table retrieved from FIFO memory 114. The Huffman-coded data is then sent through the host interface 113 to a host computer (not shown) for storage in mass storage media. The host computer may communicate directly with various modules of the system, including the quantizer 108 and the DCT block memory 103, through the host bus 115 (FIG. 6a). This host bus 115 implements a subset of the nubus standard to be discussed at a later section in conjunction with the host bus interface 113. This host bus
115 is not to be confused with internal bus 116. Internal bus 116 is under the control of the FIFO/Huffman code bus controller unit 112. Internal bus 116 provides access to data stored in the FIFO memory 114.

The architecture of the present embodiment is of the type which may be described as a heavily "pipe-lined" processor. One prominent feature of such processor is that a functional block at any given time is operating on a set of data related to the set of data operated on by another functional block by a fixed "latency" relationship, i.e. delay in time. To provide synchronization among functional blocks, a set of configuration registers are provided. Besides maintaining proper latency among functional blocks, these configuration registers also contain other configuration information.

Decompression of the video signal is accomplished substantially in the reverse manner of compression.

Structure and Operation of the Video Bus Controller Unit

The Video Bus Controller Unit 102 provides the external interface to as video input device, such as a video camera with digitized output or to a video display. The Video Bus Controller Unit 102 further provides conversion of RGB or YUV 4:4:4
formats to YUV 4:2:2 format suitable for processing with this embodiment of the present invention during compression, and provides RGB or YUV 4:4:4 formats when required for output during decompression. Hence, this embodiment of the present invention allows interface to a wide variety of video equipment.

FIG. 2 is a block diagram of the video bus controller unit (VBIU) 102 of the embodiment discussed above. As mentioned before, RGB or YUV 4:4:4 video signals come into the embodiment as 64 24-bit values, representing an 8-pixel by 8-pixel area of the digitized image. Each pixel is represented by three components, the value of each component being represented by eight (8) bits. In the RGB format each component represents the intensity of one of three primary colors. In the YUV format, the Y component represent an index of luminance and the U and V components represent two indices of chrominance. Dependent upon the mode selected, the incoming video signals in RGB or YUV 4:4:4 formats are reduced by the VBIU 102 to 64 16-bit values: 4:4:4
YUV video data and RGB data are reduced to 4:2:2 YUV data. Incoming 4:2:2 and 4:1:1 YUV data are not reduced. The process of reducing RGB data to 4:4:4 YUV data follows the formulae:

In order to perform the 4:4:4 YUV to 4:2:2 YUV format conversion, successive values of the U and V type data are averages (see below), so that effectively the U and V data are sampled at half the frequency as the Y data .

During compression mode, the 24-bit external video data representing each pixel comes into the VBIU 102 via the data I/O bus 102-2. The 24-bit video data are latched into register 201, the latched video data are either transmitted by multiplexor
203, or sampled by the RGB/YUV converter circuit 202.

During compression mode, the RGB/YUV converter circuit 202 converts 24-bit RGB data into 24-bit YUV 4:4:4 data. The output data of RGB/YUV converter circuit 202 is forwarded to multiplexor 203. Dependent upon the data format chosen, multiplexor
203 selects either raw input data (any of 4:4:4, 4:2:2, or 4:1:1 YUV formats), or YUV 4:4:4 format data (converted from RGB format) from the RGB/YUV converter circuit 202.

The input pixel data formats under compression mode are as follows: in RGB and YUV 4:4:4 formats, pixel data are written at the data I/O bus 102-2 at 24 bits per two clock periods, in the sequence (R,G,B) (R,G,B) . . . or (Y,U,V) (Y,U,V) . . . , i.e. 8 bits for each of the data types Y, U or V in YUV format, and R,G, or B in RGB format, in 4:2:2 YUV format, pixel data are written in 16 bits per two clock periods, in the sequence (Y,U) (Y,V) (Y,U) . . . ; and, in the 4:1:1 YUV format data are written in 12 bits per two clock periods, in the sequence (Y, LSB's U), (Y, MSB's U) (Y, LSB's V) (Y, MSB's V) (Y, LSB's U) . . . [MSB and LSB are respectively "most significant bits" and "least significant bits"].

The output data from multiplexor 203 is forwarded to the YUV/DCT converter unit 204, which converts the 24-bit input video data into 16-bit format for block memory unit 103. The 16-bit block storage format requires that each 16-bit datum be one of (Y,Y), (U,U), (V,V), i.e. two 8-bit data of the same type is packed in a 16-bit datum.

Therefore, the (Y,U,V) . . . (Y,U,V) format for the YUV 4:4:4 format data is repacked from 24-bit data sequence Y0U0V0, Y1U1V1, Y2U2V2, Y3U3V3, . . . Y7U7V7 to 16 -bit data sequence Y0Y1, U01U23, Y2Y3, V01V23, Y4Y5, etc., where Umn denotes the
8-bit average of U.sub.m and U.sub.n 8-bit data. Because each element of the U, V matrices under YUV 4:2:2 representation is an average value, in the horizontal direction of two neighboring pixels, the 64-value 8.times.8 matrix is assembled from an area of 16 pixel by 8 pixel in the video image. The YUV 4:2:2 representation, as discussed above, may have originated from input data either YUV 4:4:4, RGB, or YUV 4:2:2 formats.

The (Y,U), (Y,V), (Y,U), (Y,V) . . . format for the YUV 4:2:2 format is repacked from 16-bit data sequence Y0U0, Y1V0, Y2U2, Y3V2, . . . Y7V6 to Y0Y1, U0U2, Y2Y3, V0V2 etc.

Similarly, the (Y, LSB's U), (Y, MSB's U), (Y, LSB's V), (Y, MSB's V) . . . format for YUV 4:1:1 format is repacked from 12-bit data sequence Y0U0L, Y1U0H, Y2V0L, Y3V0H, Y4U4L, etc. to 16-bit data sequence Y0Y1, Y2Y3, Y4Y5, U0U4, Y6Y7, V0V4 (for pixels in the even lines of the image) or from 12-bit data sequence Y0V0L, Y1V0H, Y2U0L, Y3U0H, Y4V4L . . . to 16-bit data sequence Y0Y1, Y2Y3, Y4Y5, V0V4, Y6Y7, U0U4 (for pixels in the odd liens of the image).

During decompression, data from the block memory unit 103 are read by VBIU 102 as 16-bit words. The block memory format data are translated into the 24-bits RGB, YUV 4:4:4, or 16-bit 4:2:2, or 12-bit 4:1:1 formats as required. The translation from the 16-bit representation to the various YUV representations is performed by DCT/YUV converter 205. If RGB data is the specified output format, the DCT/YUV converter 205 outputs 24-bit YUV 4:4:4 format data for the RGB/YUV converter 202 to convert into RGB format.

Either the output data of the RGB/YUV converter 202, or the output data of the DCT/YUV converter 205 are selected by multiplexor 208 for output onto data I/O bus 102-2.

Clock circuits in sync. generator 102-1 generate the display timing signals Hsync and Vsync (horizontal synchronization signal and vertical synchronization signal, respectively) if required by the external display. The external memory address generator 207 provides the addresses on address bus 102-3 for loading the video data into an external display's buffer memory, if required. This external memory provides conversion of horizontal line-by-line "natural" video data into 8.times.8 blocks of pixel data for input during compression, and conversion of 8.times.8 blocks output pixel data into horizontal line-by-line output pixel data during decompression using addresses provided by the external memory address generator 207. Hence, the external memory address generator 207 provides compatibility with a wide variety of video equipment.

Structure and Operation of Block Memory Unit

The block memory unit (BMU) 103 assembles the stream of Y U and V interleaved pixel data into 8.times.8 blocks of pixel data of the same type (Y, U, or V).

In addition, BMU 103 acts as a data buffer between the video bus interface unit (VBIU) 102 and the DCT input select unit 104 during data compression and, between VBIU 102 and DCT row/column separator unit 107 during decompression operations.

During data compression, VBIU 102 will output pixels every clock period in the sequence YUYV - - - YUYV - - - , if a 4:2:2 format is required (each Y, U, V is a 16-bit datum containing information of two pixels); or in a sequence of YXYX - - - YUYV - - - , if a 4:1:1 format is used. ("-" indicates no output data from VBIU 102 and "X" indicates output data are of the "don't-care" type.) Since DCT input select unit 104 requires all 64 pixels (8.times.8 matrix) in a block to be available during its two-pass operation, BMU 103 must be able to accumulate a full matrix of 64 pixels of the same kind from VBIU 102 before output data can be made available to DCT input select unit 104.

During data decompression, a reverse operation takes place. The DCT row/column separator 107 outputs 64 pixels of the same kind serially to BMU 103; the pixels are temporarily stored in BMU 103 until four complete matrices of Y type pixels and one complete matrix each of U and V type pixels have been accumulated so that VBIU 102 may reconstitute the required video data for output to an external display device.

FIG. 3 shows a block diagram of BMU 103. BMU 103 consists of two parts: the control circuit 300a, and a memory core 300b. The memory core 300b is divided into three regions: Y.sub.-- region 311, U.sub.-- region 312, and V.sub.-- region 313. Each region stores one specific type of pixel data and may contain several 64 -value blocks. In this embodiment, Y.sub.-- region 311 has a capacity of five blocks and contains Y pixels only. The U.sub.-- region 312 has a capacity of more than one block, but less than two blocks and contains U type pixels only. Similarly, the V.sub.-- region has a capacity of more than one block, but less than two blocks and contains V type pixels only. This arrangement is optimized for 4:1:1 format decompression, with extra storage in each of Y, U, or V type data to allow memory write while allowing a continuous output data stream to VBIU 102. Because data are transferred into and out of the block memory unit 103 at a rate of two values every clock period, a memory structure is constructed using address aliasing (described below) which allows successive read and write operations to the same address. Since data must be output to VBIU 102 in interleaved pixel format, and since data arrive from the DCT units 104-107 in matrices each of elements of the same pixel type (Y, U or V), there are instances when elements of the next U or V matrix arrive before the corresponding elements in the U or V matrix being currently output are provided to VBIU
102. During such time periods, the elements of the next U or V matrix is allocated memory locations not overlapping the current matrix being output. Hence, the physical memory allocated for U, V blocks must necessarily be greater than one block to allow for such situations. In practice, an extra one-quarter of a block is found to be sufficient for the data formats YUV 4:2:2 and YUV 4:1:1 handled in this embodiment. The starting addresses of the regions 311, 312 and 313 are designated 0, 256 and
320 respectively. While the data transaction between BMU 103 and VBIU 102 is in units of pixels, the transaction between BMU 103 and DCT input select 104 or DCT row/column separator 107 is in units of 64-value blocks.

Memory Access Modes in the Block Memory Unit

Another aspect of this embodiment is the aliasing of the memory core addresses in the memory core 300b. Aliasing is the practice of having more than one logical address pointing to the same physical memory location. Although aliasing of memory core addresses is not necessary for the practice of the present invention, address aliasing reduces the physical size of memory core 300b and saves significant chip area by allowing sharing of physical memory locations by two 64-value blocks. This sharing is discussed in detail next.

During compression or decompression operations, data flow from respectively the VBIU 102, through BMU 103 to DCT input select unit 104, or from DCT row/column separator 107, through BMU 103, to VBIU 102. Some pats of a block might have been read and will not be accessed again, while other parts of the block remain to be read. Therefore, the physical locations in the memory core 300b which contain the parts of a block that have been read may be written over before the entire block is completely read. The management of the address mapping to allow reuse of memory locations in this manner is known as address-aliasing or "in-line" memory. In this embodiment, address aliasing logic 310 performs such mapping. A set of six registers 304 to 309
generates the logical address of a datum which is mapped into a physical address by address aliasing logic 310. Accordingly, YW address counter 304, UW address counter 305 and VW address counter 306 provide the logical addresses for a write operation in regions Y.sub.-- region 311, U.sub.-- region 312, and V.sub.-- region respectively. Similarly, YR address counter 307, UR address counter 308 and VR address counter 309 provide the read logical addresses for a read operation is Y.sub.-- region 311, U.sub.-- region 312, and V.sub.-- region 313 respectively.

The address generation logic 300a in BMU 103 mainly consists of a state counter 301, a region counter 302 and the six address counters 304 through 309 described above. Depending upon the format chosen and the mode of operation, the memory core access will follow the pattern:

A. 4:2:2 compression sequence--YUYVRRRR YUYVRRRR

B. 4:1:1 compression sequence--YXYXRRRR YUYVRRRR

C. 4:2:2 decompressions sequence--WWWWYUYV WWWWYUYV

D. 4:1:1 decompression sequence--WWWWYUYV WWWWYUYV

where the Y, U or V in compression sequence indicates a Y, U or V data is written from the VBIU 102 into BMU 103. The "R" in the compression sequence indicates a datum is to be read from BMU 103 to DCT input select unit 104. The Y, U or V in the decompression mode indicates a Y, U or V datum is to be read from BMU 103 into VBIU 102. The "W" in a decompression sequence indicates that a datum is to be written from DCT row/column separator 107 into BMU 103. Because the sequences repeat themselves every 16 clock periods, a 4-bit state counter 301 is sufficient to sequence the operation of the BMU 103.

The region counter 302 is used to indicate which region, among Y.sub.-- region 311, U.sub.-- region 312, and V.sub.-- region 313, the read or write operation is to take place. The region counter 302 output sequences in blocks for the several modes of operation are as follows:

4:2:2 compression: YYUV YYUV

4:1:1 compression: YY--YYUV

4:2:2 decompression: YYUVYYUV

4:1:1 decompression: YY--YYUV

Data Flow in the Discrete Cosine Transform Units

The Discrete Cosine Transform (DCT) function in the embodiment described above in conjunction with FIG. 1 involves five functional units: the block memory unit 103, the DCT input select unit 104, the DCT row storage unit 105, the DCT/IDCT processor 106, and the DCT row/column separator 107. The DCT function is performed in two passes, first in the row direction and then in the column direction.

FIG. 4a shows a data flow diagram of the DCT units. The input video image in a 64-value pixel matrix is first processed two values at a time in the DCT/IDCT processor 106, row by row, shown as the horizontal rows row0-row7 in FIG. 4a. The row-processed data are serially stored temporarily into the DCT row storage unit 105, again two values at a time. The row-processed data are then fed into the DCT/IDCT processor 106 for processing in the column direction col10-col17 in the second pass of the 2-dimensional DCT. The DCT row/column separator 107 streams the row-processed data into the DCT row storage unit 105, and the data after the second pass (i.e., representation in transform space) into the quantizer unit 108.

FIG. 4b shows the data flow schedule of the 4:1:1 data input into the DCT units 103-107 (FIG. 1) under compression mode. In FIG. 4b, the time axis runs from left to right, with each timing mark denoting four clock periods. In the vertical direction, this diagram in FIG. 4b is separated into upper and lower portions, respectively labelled "input data" and "DCT data." The input data portion shows the input data stream under the 4:1:1 format, and the DCT data portion shows the sequence in which data are selected from block memory unit 103 to be processed by the DCT/IDCT processor unit 106.

As described above in conjunction with VBIU 102, under the 4:1:1 YUV data format, the Y data come into the DCT units 103-107 at 8 bits per two clock periods, and the U, V data come in at 4 bits per two clock periods, with "don't-care" type data being sent by VBIU 102 50% of the time. Hence, for a 64-value 8 pixels by 8 pixels matrix, the U and V matrices each requires 512 clock periods to receive; during the same period of time, four 64-value Y matrices are received at DCT units 103-107. This
512-clock period of input data is shown in the top portion of FIG. 4b.

Under compression mode, as described above, the input data are assembled into 8.times.8 matrices of like-type pixels in the block memory unit 103. The DCT input select unit 104 selects alternatively DCT row storage unit 105 and the block memory unit 103 for input data into the DCT/IDCT processor unit 106. The input data sequence into the DCT/IDCT processor 106 is shown in the lower portion of FIG. 4b, marked "DCT data."

In FIG. 4b, first-pass YUV data (from block memory unit 103) coming into the DCT/IDCT processor unit 106 are designated Y.sub.-- row, U.sub.--, and V.sub.-- row, the second-pass data (from DCT row storage unit 105) coming into the DCT/IDCT processor 105 are designated Y.sub.-- col, U.sub.-- col, and V.sub.-- col. Between the time marked 401b and the time marked 403b, the DCT/IDCT processor unit 106 processes first-pass and second-pass data alternately. The first-pass and second-pass data during this period from 401b to 403b are data from a previous 64-value pixel matrix due to the lag time between the input data and the data being processed at DCT units 103-107. Because of the buffering mechanism described above in the block memory unit
103, pixel data coming in between the times marked 401b and 409b in FIG. 4b are stored in the block memory unit 103, while the pixel data stored in the last 512 clock periods are processed in the DCT units 104-107. The data from the last 512 clock periods are processed beginning at time marked 404b, and completes after the first 128 clock periods (identical to time period marked between 401b and 403b) of the next 512 clock periods.

The time period between marks 403b and 404b is "idle" in the DCT/IDCT processor 106 because the pipelines in DCT/IDCT processor unit 106 are optimized for YUV 4:2:2 data. Since the YUV 4:1:1 type data contain only half as much U and V information as contained in YUV 4:2:2 type data, during some clock periods the DCT/IDCT processor unit 106 must wait until a full matrix of 64 values is accumulated in block memory unit 103. In practice, no special mechanism is provided in the DCT/IDCT processor unit 106 for waiting on the input data. The output data of DCT/IDCT processor unit 106 during this period are simply discarded by the zero packer/unpacker unit 110 according to its control sequence. The control structures for DCT input select unit 104 and DCT row/column separator units 107 will be discussed in detail below.

FIG. 4c shows the data flow schedule for YUV 4:2:2 type data under compression mode. Under this input data format, as discussed above, an 8-bit U or V type value is received at the DCT units 103-107 every two clock periods; so that it requires
256 clock periods to receive both 64 8-bit U and V matrices. During this 256-cycle period, two 64-value Y matrices are received at DCT units 103-107. This 256-clock period is shown in FIG. 4c. There are no idle cycles under the YUV 4:2:2 type data. Again, because of the buffering scheme in the block memory unit 103, the DCT/IDCT processor 106 processes the data from the last 256-clock period, while the current incoming data are being buffered at the block memory unit 103.

Under decompression, the basic input data pattern to the DCT units 103-107 are: a) under YUV 4:1:1 format, two 64 16-bit values Y matrices, followed by the U and V matrices of 64 16-bit values each, and then two 64 16-bit values Y matrices; b) under YUV 4:2:2 format, two 64 16-bit values Y matrices, followed by the first U and V matrices of 64 16-bit values each, and then two 64 16-bit values Y matrices, followed by the second U and V matrices.

FIG. 4d shows the data flow schedule for the YUV 4:1:1 data format under decompression mode.

Since the decompression operation is substantially the reverse of the compression operation, the input data stream for decompression comes from the quantizer unit 108. The DCT input select unit 104, hence, alternately selects input data between DCT row storage unit 105 and the quantizer unit 108. Since the data stream must synchronize with timing of the external display, idle periods analogous to the period between the times marked 403b and 404b in FIG. 4b are present. An example of an idle period under YUV 4:1:1 format is the period between 404d and 405d in FIG. 4d. Instead of .sub.-- row and .sub.-- col designation under compression mode, FIGS. 4d uses .sub.-- 1st and .sub.-- 2nd designation to highlight that the data being processed in the DCT/IDCT units 103-107 are values in the transform (frequency) domain.

Similarly, FIG. 4e shows the data flow schedule for the YUV 4:2:2 data format under decompression. Again, because the design in the DCT/IDCT processor 106 is optimized for YUV 4:2:2 data, there are no idle cycles for data in this input format.

Structure and Operation of the DCT Input Select Unit

The implementation of the DCT input select unit 104 is next described in conjunction with FIGS. 5a, 5b and 5c.

The DCT Input Select Unit directs two streams of pixel data into the DCT/IDCT processor unit 106. The first stream of pixel data is the first-pass pixel data from either DCT block memory unit 103 or quantizer 108, dependent upon whether compression or decompression is required. This first stream of pixel data is designated for the first-pass of DCT or IDCT. The second stream of pixel data is streamed from the DCT row storage unit 105; the second stream of pixel data represents intermediate results of the first-pass DCT or IDCT. This second stream of pixel data needs to be further processed in a second-pass of the DCT or IDCT. By having the same DCT/IDCT processor unit 106 to perform the two passes of DCT or IDCT, utilization of resource is maximized. The DCT Input Select Unit 104 provides continuous input data stream into the DCT/IDCT processor unit 106 without idle cycle under YUV 4:2:2 format.

FIG. 5a is a schematic diagram of the DCT input select unit 104. As discussed above, the DCT input select unit 104 takes input data alternately from the quantizer unit 108 and DCT row storage unit 105 during decompression. During compression, input data to the DCT input select unit 104 are taken alternately from the block memory unit 103 and the DCT row storage unit 105.

During compression, when input data are taken from the block memory unit 103, two streams of 8-bit input data are presented on the 518a and 518b data busses. As shown in FIG. 5a, these two streams of data are then latched successively into one pair of the four pairs of latches (top-bot): 501c and 505c, 502c and 506c, 503c and 507c, 504c and 508c by the control signals blk.sub.-- load 4, blk.sub.-- load 5, blk.sub.-- load 6, and blk.sub.-- load 7 respectively. Each pair of latches consists of a top latch and a bottom ("bot") latch. The control signal (e.g. blk.sub.-- load7) associated with a latch pair loads both the top and bottom latches. Latches 501c to 508c temporarily store data so that this can be properly sequenced into the DCT unit
106.

A set of four 2-to-1 8-bit multiplexors 512c, 513c, 514c and 515c (called block multiplexors) each selects either the top or bottom output datum from one of the four pairs of latches 501c-505c, 502c-506c, 503c-507c and 504c-508c, for input to another set of four 2:1 multiplexors 516a, 516b, 516c, and 516d (called block/quantizer multiplexors). The output datum selected by the block multiplexors from the pairs of latches 501c-505c and 502c-506c are denoted "block top data", and the output data selected from the pair of latches 503c-507c and 504c-508c are denoted "block bot data". The block/quantizer multiplexors 516a-d are 16-bit wide, and select between the output data of block multiplexors 512c to 515c, and the quantizer multiplexors
511a and 511b, in a manner to be discussed below.

During compression, the bloc/quantizer multiplexors 516-d are set to select the output data of the block multiplexors 512c to 515c, since there is no output from the quantizer 108. The output data of the block/quantizer multiplexors 516a and
516c are denoted "block/quantizer top data"; being selected between block top data and quantizer top data (selected by multiplexer 511a, discussed below); the output data of the block/quantizer multiplexors 516b and 516d are denoted "block/quantizer bot data", being selected between block bot data and quantizer bot data (selected by multiplexor 511b, discussed below). Since the block multiplexors 512c-515c are each 8-bit wide, eight zero bits are appended to the least significant bits of each output datum of the block multiplexors 512c-515c to form a 16-bit word at the block/quantizer multiplexors 516a-d. The most significant bit of this 16-bit word is inverted to offset the resulting value by -2.sup.15, to obtain a value in the appropriate range suitable for subsequent computation.

Two streams of input data, each 16-bit wide, are taken from the DCT row storage unit 105. The data flow path of the DCT row data in DCT row storage unit 105 to the DCT/IDCT processor unit 106 is very similar to the data flow path of the input data from the block memory storage unit 103 to the DCT/IDCT processor unit 106 described above. Four pairs of latches (top-bot): 501d-505d, 502d-506d, 503d-507d, and 504d-508d are controlled by control signals row.sub.-- load 0, row.sub.-- load 1, row.sub.-- load 2, and row.sub.-- load 3 respectively. A set of four 4:1 multiplexors 512d, 513d, 514d and 515d (called DCT row multiplexors) selects the output data (called DCT row top data) of two latches from the two pairs controlled by signals row.sub.-- load0 and row.sub.-- load1 (i.e. the two pairs 501d--505d and 502d--506d), and the output data (called DCT row bot data) of two latches from the two pairs controlled by signals row.sub.-- load 2 and row.sub.-- load 3 (i.e. the two pairs
503d-507d, and 504d-508d).

During decompression, as discussed above, data into the DCT/IDCT processor unit 106 (FIG. 1) are taken alternately from the DCT row storage unit 105 and the quantizer 108. Hence, during decompression, the block/quantizer multiplexors (516a-d) are set to select from the quantizer multiplexors (511a-b), rather than the block multiplexors.

A single stream of 16-bit data flows from the quantizer unit 108 (FIG. 1) on the bus 519. A 16-bit datum can be latched into any one of 16 latches assigned in two banks: 501a-508a (bank 0), or 501b-508b (bank1), each latch is controlled by one of the control signals load0-load15. A set of four 4:1 multiplexors: 509a (called quantizer bank 0 top multiplexor), 510a (called quantizer bank 0 bot multiplexor), 509b (called quantizer bank 1 top multiplexor), and 510b (called quantizer bank 1 bot multiplexor) selects four data items, each from a separate group of four latches in response to signals to be described later. Quantizer bank 0 top multiplexor 509a selects one output datum from the latches 501a502a, 505a, and 506a. Quantizer bank 0
bot multiplexor 510a selects one output datum from the latches 503a, 504a, 507a and 508a. Quantizer bank 1 top multiplexor 509b selects one output datum from the latches 501b, 502b, 505b, and 506b. Quantizer bank 1 bot multiplexor 510b selects one output datum from the latches 503b, 504b, 507b, and 508b.

A set of two 2:1 multiplexors 511a and 511b (quantizer multiplexors) then selects a quantizer top data item and a quantizer bot data item respectively. Quantizer top data item is selected from the output data items of the quantizer bank 0 and bank 1 top data items (output data of multiplexors 509a and 509b) ; and likewise, quantizer bot data item is selected from the output data items of the quantizer bank 0 and bank 1 bot data items (output data of multiplexors 510a and 510b). The quantizer top and bot data items are provided at the block/quantizer multiplexors 516a-516d, which are set to select the quantizer top and bot data items (output data of multiplexors 511a and 511b) during decompression.

Finally, a set of four 2:1 multiplexors 517a-d selects between the DCT row top and bot data (output data of multiplexors 512d-515d) and the block/quantizer top and bot data (output data of multiplexors 516a-516d) to provide the input data into the DCT/IDCT processor unit 106 (FIG. 1). Multiplexor 517a selects between one set of block/quantizer multiplexor top data 516a and DCT row storage top data 514d to provide "A" register top data 517a; multiplexor 517c selects from the other set of block/quantizer multiplexor top data 516c and row storage top data 512d to provide "B" register top data. The two sets of quantizer multiplexor top data 516b and 516d and DCT storage bot data 515d and 513d provide the "A" register bot data 517b, and "B" register bot data 517d, respectively.

Operation of DCT Input Select Unit During Compression

Having described the structure of DCT input select unit 104, the operation of the DCT input select unit 104 is next discussed.

FIG. 5b shows the control signal and data flow of the DCT input select unit 104 during compression mode. The DCT input select unit 104 can be viewed as having sixteen internal states sequenced by the sixteen successive clock periods. FIG. 5b shows sixteen clock periods, corresponding to one cycle through the sixteen internal states. For compression mode, the internal states of the DCT units 104-107 for clock periods 0 through 7 are identical to the internal states of the DCT units 104-107
for clock periods 8 through 15. FIG. 5b shows the operations of the DCT input select unit 104 (FIG. 1) with respect to one row of data from the DCT row storage unit 105 and one row of input data from the block memory unit 103.

The first four clock periods illustrated (i.e. clock periods 0, 1, 2 and 3) are the loading phase of data on busses 518c and 518d into the latches 501d-508d from the DCT row storage unit 105. These first four clock periods are also the processing phase of the data from the block memory unit 103 loaded into latches 501c-508c in the last four clock periods. The processing of the block memory data stored in latches 501c-508c will be described below using an example, in conjunction with discussion of clock periods 8 through 11, after the loading of block memory data from block memory unit 103 is discussed in conjunction with clock periods 4 through 7.

During the first four clock periods (0-3), a row of data from DCT row storage unit 105 is loaded in the order Y(0), Y(1) . . . Y(7) in pairs of two into latch pairs 501d-505d, 502d-506d, 503d-407d and 504d-508d by successive assertion of control signals row.sub.-- load 0 through row.sub.-- load 3.

In the next our clock periods 4 through 7, the DCT input select unit 104 (FIG. 1) forwards to the DCT/IDCT processor 106 the data loaded from the DCT row storage unit 105 in the last four clock periods 0-3, and at the same time, loads data from the block memory unit 103. The multiplexors 517a through 517d are set to select DCT row storage data in latches 501d-508d. The DCT row storage multiplexors 512d through 515d are activated in the next four clock periods to select, at clock period 4 and
5 elements Y(2) and Y(5) to appear as output data of multiplexors 517a and 517b respectively ("A" register top and bot multiplexors), and Y(1) and Y(6) to appear as output data of 517c and 517d ("B" register top and bot multiplexors) respectively. At clock periods 6 and 7, Y(3) and Y(4) appear as the output data of multiplexors 517a and 517b respectively, and Y*0) and Y(7) appear as output data of multiplexors 517c and 517d respectively. During this time, multiplexors 517a through 517d are selecting DCT row storage data in latches 501d-508d.

During clock periods 4 through 7, a row of block memory data x(0) x(1) . . . x(7) are latched into latches 501c through 508c by control signals blk.sub.-- load 4 through blk.sub.-- load7 in the same manner as the latching of DCT row storage data into latches 501d-508d during clock periods 0 through 3.

During the next four clock periods 8 thorough 11, the DCT input select unit 104 is successively in the same states as it is during clock periods 0 through 3; namely, loading from DCT row storage unit 105 and forwarding to DCT/IDCT processor unit
106 the data X(0) . . . x(7) loaded in latches 501c-508c from block memory unit 103 during the last four clock periods 4-7.

In clock periods 8 through 11, multiplexors 517a through 517d select data from the block/quantizer multiplexors 516a through 516d, which in turn are set to select data from the block memory multiplexors 512c through 515c. The block memory multiplexors 512c through 515c are set such that during clock periods 8 through 9, x(2) and x(5) are available at multiplexors 517a and 517b, respectively; and during the same clock periods 8 through 9, x(1) and x(6) are available at multiplexors 517c and 517d respectively.

Operation of DCT Input Select Unit During Decompression

The operation of DCT input select unit 104 during decompression mode is next discussed in conjunction with FIG. 5c.

FIG. 5c shows the control and data flow of the DCT input select unit 104 during decompression mode. As mentioned above, the DCT input select unit 104 may be viewed as having 16 internal states. As shown in FIG. 5c, during the 16 clock periods 0
to 15, two rows of data from DCT row storage unit 105 (clock periods 0-3 and 8-11) and two columns of data from the quantizer unit 108 are forwarded as input data to the DCT/IDCT processor unit 106 (clock periods 0-15).

As shown in FIG. 5c, a continuous stream of 16-bit data is provided by the quantizer unit 108 to the DCT input select unit 104 at one datum per clock period. A double-buffering scheme provides that when latches in bank 0 (latches 501a through
508a) are being loaded, the data in bank 1 (latches 501b through 508b) are being selected for input to the DCT/IDCT processor unit 106. The latches are loaded, beginning at 501a through 508a in bank 0 by control signals load0 through load7 respectively (at clock periods 0 through 7), and then switching over to bank 1 to load latches 501b through 508b by control signals load8 through load15 respectively (clock periods 8 through 15). During clock periods 8 through 11, while bank 1 is being loaded, the data in bank 0 x(0) . . . x(7) (loaded during clock periods 0 through 7) are being selected for input into the DCT/IDCT processor unit 106. The order of selection is shown in FIG. 5c in the sequence (top-bot): x(1 )-x(7) in clock period 8, x(3)-x(5) in clock period 9, x(2)-x(6) for clock period 10, and x(0-x(4) in clock period 11. The same top data appear in both DCT "A" register top data and DCT "B" register top data. The bot data for the bot registers of "A" and "B" are the same as well. During clock periods 0 through 3 in the four clock periods following clock period 15 shown in FIG. 5c (analogous to clock periods 0 through 3 shown), the new data in latches 501b through 508b are selected in similar order for input to the DCT/IDCT processor unit 106.

Loading and processing of the data from the DCT row storage unit 105 follow the same pattern as in the compression mode: i.e. four clock periods during which the latch pairs in 501d through 508d are loaded by control signals row.sub.-- load0
through row.sub.-- load3 respectively at one pair of two 16-bit data per clock period. (The latches pair are 501d-505d, 502d-506d, 503d-507d and 504d-508d). For example, during clock periods 0 through 3, the latches are loaded with a row of 16-bit data Y(0) . . . Y(7) from DCT row storage. In the next four clock periods, 4 through 7, 16-bit data Y(0) . . . Y(7) in the latches 501d through 508d are provided as input to DCT/IDCT processor unit 106 in the sequence ("A" register top, "A" register bot, "B" register top, "B" register bot): (Y(7), Y(1), Y(7)), at clock period 4, (Y(3 ), Y(5), Y(3), Y(5)) at clock period 5, (Y(2), Y(6), Y(2), Y(6)) at clock period 6, and (Y(0), Y(4), Y(0), Y(4)) at clock period 7.

Analogous loading and processing phases are provided at clock periods 8 through 15. Data in the latches 501d through 508d (DCT row storage data) are alternately selected every 4 clock periods with the data from the quantizer unit 108 for input to DCT/IDCT processor unit 106. For example, during clock periods 0 through 3, and 8 through 11, data from the quantizer unit 108 is provided for input to DCT/IDCT processor unit 106 and during clock periods 4 through 7, and 12 thorough 15, DCT row storage data are provided for input to DCT/IDCT processor unit 106.

Structure and Operation of the DCT Row Storage Unit

The structure and operation of DCT row storage unit 105 (FIG. 1) is next described in conjunction with FIGS. 6a-c.

FIG. 6a is a schematic diagram of the DCT row storage unit 105.

The storage in DCT row storage unit 105 is implemented by two 32.times.16-bit static random access memory (SRAM) arrays 609 and 610, organized as "even" and "odd" planes. 2:1 multiplexors 611 and 612 forward to DCT input select unit 104 the output data read respectively from the odd and even planes of the memory arrays 609 and 610.

Configuration register 608 contains configuration information, such as latency value (for either compression or decompression) to synchronize output from the DCT row/column separator into DCT row storage 105, so that, according to the configuration information in the configuration register 608, the address generator 607 generates a sequence of addresses for the SRAM arrays 610 and 609.

The memory arrays 609 and 610 can be read or written by a host computer via the bus 115 (FIG. 6a). 2:1 multiplexors 605, 606 select the input address provided by the host computer on bus 613 when the host computer requests access to SRAM arrays
609 and 610.

Incoming data from the DCT row/column separator unit 107 arrive at DCT row storage unit 105 on two 16-bit buses 618 and 619. As described above, a host computer may also write into the SRAM arrays 609 and 610. The data from the host computer are latched into the SRAM arrays 609 and 610 from the 16-bit BUS 615. Alternatively, a set of 2:1 multiplexors 601-604 multiplex the data from DCT/IDCT processor unit 106 on buses 618, 619 to be written into either SRAM array 609 or 610 according to the memory access schemes to be described below.

Two 16-bit outgoing data words are placed on busses 616 and 617, transmitting to output data from the SRAM arrays 610 and 609, respectively. 2:1 multiplexors 611 and 612 select the data on busses 616 or 617 to place on busses 626 and 627, two
16-bit data words per clock period, in the order required by the DCT/IDCT algorithms implemented in the DCT/IDCT processor unit 106, already described in conjunction with DCT input select unit 104.

Alternatively, output data from the SRAM arrays 609 and 610 on busses 616 and 617 may be output on bus 614 under direction of a host computer (not shown). The host computer (not shown) would be connected onto host bus 115 as described in the IEEE standard attached hereto as Appendix B.

The In-Line Memory of the DCT Row Storage Unit

Because two 16-bit values are written into or read from DCT row storage unit 105 per clock period, and because of the order in which DCT or IDCT first-pass data is accessed, an efficient scheme of reading and writing the SRAM arrays 609 and 610
is provided, such that the same memory locations may be written into with a row of data in the incoming 8.times.8 matrix after a column of data is read from the last 8.times.8 matrix. In this manner, an "in-line" memory access scheme is implemented, which requires 50% less storage than a comparable double-buffering scheme.

In order to achieve the "in-line" memory advantage, the SRAM arrays 609 and 610 are written and read under the "horizontal" and "vertical" access pattern alternately. Memory maps (called "write patterns") are shown in FIG. 6b and 6c for the horizontal and vertical access patterns respectively.

FIG. 6b shows the content of the SRAM arrays 609 and 610 with an 8.times.8 first pass result matrix completely written. For example, even and odd portions of logical memory location 0, 0e and 0o, contain elements respectively X0(0) and X0(1) of row X0; 0e and 0o correspond to address 0 in the E-plane (SRAM array 609) and O-plane (SRAM array 610) respectively. Because of their independent input and output capabilities, an E-plane datum and an O-plane datum may be accessed simultaneously during the same clock period. There are 32 memory locations in each of the E-plane and O-plane of the SRAM arrays 609 and 610; the "e" addresses are found in the E-plane, and the "o" addresses are found in the O-plane. Thus a total of 64 data words can be stored in the even and odd plane taken together.

During compression, the use of the words "row" and "column" refer to the rows and columns of the pixel matrix, while during decompression, "rows" and "columns" refer to the "rows" and "columns" of the frequency matrix.

During any clock period, either two 16-bit data arrive from DCT row/column separator unit 107 on busses 618 and 619 (input mode), or two 16-bit data go to the DCT input select unit 104 via busses 626 and 627 (output mode). The period of horizontal access pattern consists of 64 clock periods, during which there are eight (8) cycles each of four clock periods of read memory access followed by four clock periods of write memory access. In the horizontal access pattern, during compression, the outgoing data are provided to DCT input selected unit 104 column by column "horizontally," and the incoming data are written into the SRAM arrays 609 and 610 row by row "horizontally." During decompression, the outgoing data are provided to DCT input select unit 104 row by row horizontally, and the incoming data are written column by column horizontally.

The following description is based on the data flow during compression only. During decompression, the incoming data into the DCT row storage unit 105 are columns of a matrix and the outgoing data into DCT input select unit 104 are rows of a matrix, but the principles of horizontal and vertical accesses are the same.

FIG. 6b shows a 8.times.8 matrix X with rows X0-X7 completely written horizontally into the SRAM arrays 609 and 610. FIG. 6b is the map of SRAM arrays 609 and 610 at the instant in time after the last two 16-bit data from the previous matrix are read, and the last two 16-bit data of the current matrix X(X7(6) and X7(7) are written into the SRAM arrays 609 and 610.

Because the second pass of the 2-dimensional DCT requires data to be read in pairs, and in column order, i.e. in the order X0(0)-X1(0), X2(0)-X3(0), . . . X6(0)-X7(0), X0(1)-X1(1) . . . X6(7-X7(7), after a column (for example, X0(0), X1(0) . . . X7(0), is read, the memory location Oe, 4o, 8e, 12o, . . . , 28o previously occupied by the column X9(0) . . . X7(0) are now available for storage of the incoming row y0 with elements Y0(0) . . . Y0(7).

After the first column X0(0. . . X7(0is read and replaced by row Y0(0) . . . Y0(7), the second column X0(1) . . . X7(1) is read and replaced by row Y1(0) . . . Y1(7). This process is repeated until all of matrix X is read and replaced by all of matrix Y, as shown in FIG. 6c. Since during this period, data are read and written "vertically," this access pattern is called vertical access pattern.

The output of matrix Y will be column by column to DCT input select unit 104. Because these columns are located "horizontally" in the SRAM array 609 and 610, the writing of the next incoming matrix row by row will be horizontally also, i.e., to constitute the horizontal access pattern.

In order to allow data to be written vertically and accessed horizontally, or vice versa, each row's first element, e.g., X0(0), X1(0) etc. must be alternately written in the E-plane and O-plane, as shown in FIGS. 6b and 6c, since adjacent 16-bit data in the same column must be accessed in pairs at the same time.

In this manner, an "in-line" memory is implemented resulting in a 50% saving of storage space over a double buffering scheme.

Structure and Operation of the DCT/IDCT Processor Unit

Input data for the DCT/IDCT processor unit 106 are selected by the multiplexors 517a through 517d in the DCT input select unit 104. The input data to the DCT/IDCT processor 106 are four 16-bit words latched by the latches 701t and 701b (FIG.
7a). The DCT/IDCT processor unit 106 calculates the discrete cosine transform or DCT during compression mode, and calculates the inverse discrete cosine transform IDCT during decompression mode.

According to the present invention, the DCT and IDCT algorithms are implemented as two eight-stage pipelines, in accordance with the flow diagrams in FIGS. 7b and 7e. During compression the flow diagram in FIG. 7b is the same as FIG 15d, except for the last multiplication step involving g[0], h[0] . . . i[0 ] (FIG. 15d). Because the quantization step involves a multiplication, the last multiplication of the DCT is deferred to be performed with the quantization step in the quantizer 108, i.e., the quantization coefficient actually employed is the product of the default JPEG standard quantization coefficient and the two deferred DCT multiplicands, one from each pass through the DCT/IDCT processor unit 106. During IDCT, multiplicands are premultiplied in the dequantization step. This deferment or premultiplication is possible because during DCT, all elements in a column have the same scale factor, and during IDCT all elements in a row have the same scale factor. By deferring these multiplication steps until the quantization step, two multiplies per pixel are saved. In the flow diagrams of FIGS. 7b and 7e, input data flows from left to right. A circle indicates a latch or register, and a line joining a left circle with a right circle indicates an arithmetic operation performed as a datum flow from the left latch (previous stage) to the right latch (next stage). A constant placed on a line joining a left latch to a right latch indicates that the value of the datum at the left latch is scaled (multiplied) by the constant as the datum flows to the right latch; otherwise, if no constant appears on the joining line, the datum on the left latch is not scaled. For example, in FIG. 7b, r3 in stage 6 is derived by having p3 scaled by 2cos(pi/4), and r2 is derived by having p2 scaled by 1 (unscaled). A latch having more than one line converging on it, and each line originating from the left, indicates summation at the right latch of the values in each originating left latch, and according to the sign shown on the line. For example, in FIG. 7b, y5 is the sum of x(3) and -x(4).

As shown in FIG. 7b, for the forward transform (DCT) algorithm, between stages 1 and 2 is a shuffle-and-add network, with each datum at stage 2 involving exactly two values from stage 1. Between the stages 2 and 3 are scaling operations involving either constants 1 or 2cos(pi/4). Stage 4 is either an unscaled stage 3 or a shuffle-and-add requiring a value at stage 2 and a value at stage 3. Between stages 4 and 5 is another shuffle-and-add network, and again each datum at stage 5 is the result of exactly two data items at stage 4. Stage 6 is a scaled version of stage 5, involving scaling constants 2cos(pi/4), 2cos(pi/8), 2cos(3pi/8) and 1. Stage 7 data are composed of scaled stage 6 data and summations requiring reference to stage
5 data. Finally, between stage 8 and stage 7 is another shuffle-and-add network, each datum at stage 8 is the result of summation of two data items at stage 7.

According to the present invention as shown in FIG. 7e, the algorithm for the inverse transform (IDCT) follows closely an 8-stage flow network as in the forward transform, except that scaling between stages 2 and 3 involves additionally the constants 2cos(pi/8) and 2cos(3pi/8), and the shuffle-and-add results at stages 4 and 7 involve values from their respective immediately previous stage, rather than requiring reference to two stages. Hence, with accommodation for the differences noted in the above, it is feasible to implement the forward and inverse algorithms with the same 8-stage processor.

Because no shuffle-and-add in the data flow involves more than two values from the previous stage, these algorithms may be implemented in two 8-stage pipelines with cross-over points where shuffle-and-add operations are required.

FIG. 7a shows the hardware implementation of the flow diagrams in FIGS. 15Zd and 15e derived above in the discussion of filter implementation. The two 8-stage pipelines shown in FIG. 7a implement, during compression, the filter tree of FIG. 15b in the following manner: operations between stages 1 and 2 implement the first level filters 1501 and 1502; operations between stages 2-8 implement the second level filters 1503-1506; and, between stages 5-8 implement the third level filters 1507-1514. As explained above, the operation of each of the filters 1515-1530 corresponds to the last multiplication step in each pixel. This last multiplication step is performed inside the quantizer 108 (FIG. 1).

The DCT/IDCT processor unit 106 is implemented by two data paths 700a and 700b, shown respectively in the upper and lower portions of FIG. 7a. Data may be transferred from one data path to the other via multiplexors such as 709, 711t, 722t,
722b, 7321t, or 733t. Adders 735t and 735b also combine input data from one data path with input data in the other data path. Control signals in the data path are data-independent, providing proper sequencing of data in accordance with the DCT or IDCT algorithms shown in FIGS. 7b and 7e. All operations in the DCT/IDCT processor 106 shown in FIG. 7a involve 16-bit data. Adders in the DCT/IDCT processor unit 106 perform both additions and subtractions.

The two pairs of 16-bit input data are first latched into latches 701t ("A" register) and 701b ("B" Register). The adders 702t and 702b combine the respective 16-bit data in the A and B registers. The "A" and "B" latches each holds two 16-bit data words. The A and B registers are the stage 1 latches shown in FIGS. 7b and 7e. The results of the additions in adders 702t and 702b are latched respectively into the latches 703t and 703b (stage 2 latches). The datum in latch 703t is simultaneously latched by latch 707t, and multiplied by multiplier 706 with a constant stored in latch 705, which is selected by multiplexor 704. The constant in latch 705 is either 1, 2cos(pi/4), 2cos(3pi/8) or 2cos(pi/8). The result of the multiplication is latched in latch 708t (a stage 3 latch).

Alternatively, the datum in latch 703t may be latched by latch 707t to be then selected by multiplexor 709 for transferring the datum into data path 700b. 2:1 Multiplexor 709 may alternatively select the datum in latch 708t for the transfer. The datum in 703b is delayed by latch 707b before being latched into 708b (a stage 3 latch). This datum in 708b may either be added in adder 710 to the datum selected from the data path 700a by multiplexor 709 and then latched into latch 712b through multiplexor 711b or be passed into data path 700a through 2:1 multiplexor 711t and be latched by latch 712t (a stage 4 latch), or be directly latched into 712b (a stage 4 latch) through multiplexor 711b.

The datum in latch 708t may be selected by multiplexer 711t to be latched into latch 712t, or as indicated above, passed into data path 700b through multiplexor 709. The data in latches 712t and 712b may each pass over to the opposite data path,
700b and 700 a respectively, selected by 2:1 multiplexors 713t and 713b into latches 714t or 714b respectively. Alternatively, the data in latches 712t and 712b may be latched in their respective data path 700a and 700b into latches 714t or 714b through multiplexors 713t and 713b.

A series of latches, 715t through 720t in data path 700a, and 715b to 719b in data path 700b, are provided for temporary storage. Data in these latches are advanced one latch every clock cycle, with the content of latches 720t and 719b discarded, as data in 719t and 718b advance into latches 720t and 719b. In data path 700a, the 5:1 multiplexor 721t may select any one of the data in the latches 715t through 718t, or from 714t, as an input operand of adder 723t. 5:1 multiplexor 722t selects a datum in any one of 714t, 716t through 718t or 720t as an input operand into adder 723b in data path 700b. Similarly, in data path 700b, 3:1 multiplexor 722b selects from latches 716b, 717b, and 719b an input operand into adder 723 t in data path 700a. 5:1 multiplexor 721b selects one datum from the latches 715b through 719b, as an input operand to adder 723b.

The results of the summations in adders 723t and 723b are latched into latches 724t and 724b (stage 5 latches) respectively. The datum in latch 724t may be multiplied by multiplier 727 to a constant in latch 726, which is selected by 4:1
multiplexor 725, from among the constants 1, 2cos(ps/8, 2cos(3pi/8, or 2cos(pi/4). Alternatively, the datum in latch 724t may be latched into latch 730 after a delay at latch 728t. The result of the multiplication is stored in latch 729t (a stage 6
latch). The 2:1 multiplexor 731t may channel either the datum in latch 729t or in latch 730 as an input operand of adder 732 in data path 700b. The datum in latch 729t can also be passed to latch 734t (a stage 7 latch) through 2:1 multiplexor 733t.

The datum in latch 724b is passed to latch 728b, which is then either passed to adder 732 through 2:1 multiplexor 731b, to be added to the datum selected by 2:1 multiplexor 731t, or passed to latch 729b (a stage 6 latch). The datum in latch 729b may be passed to data path 700a by 2:1 multiplexor 733t, or passed as operand to adder 732 through 2:1 multiplexor 731b, to be added to the datum selected by 2:1 multiplexor 731t, or be passed to latch 734b (stage 7 latch through 2:1 multiplexor 733b.

Adders 735t and 735b each add the data in latches 734t and 734b, and deliver the results of the summation to latches 736t and 736b (both stage 8 latches) respectively. The data in latches 736t and 736b leave the DCT/IDCT processor 106 through latches 738t and 738b respectively, after one clock delay at latches 737t and 737b respectively.

Multipliers 706 and 727 each require two clock periods to complete a multiplication. Each multiplier is provided an internal latch for storage of an intermediate result at the end of the first clock period, so that the input multiplicand need only be stable during the first clock period at the input terminals of the multiplier. Both during compression and decompression, every four clock periods a new row or a column of data (eight values) are supplied to the DCT/IDCT Processor Unit 106 two values at a time. Hence, the control signals inside the DCT/IDCT Processor Unit 106 repeats every four clock periods.

Operation of DCT/IDCT Processor Unit During Compression

Having described the structure of the DCT/IDCT processor unit 106, the algorithms implements are next described in conjunction with FIGS. 7b, 7c and 7d for compression mode, and in conjunction with FIGS. 7e, 7f and 7g for decompression mode.

The DCT/IDCT processor unit 106 calculates a 1-dimensional discrete cosine transform for one row (eight values) of pixel data during compression, and calculates a 1-dimensional inverse discrete cosine transform for one column (eight values) of pixel data during decompression.

FIG. 7b is a flow diagram representation of the DCT algorithm for a row of input data during compression mode. FIG. 7c shows the implementation of the DCT algorithm shown in FIG. 7b in accordance with the present invention. FIG. 7d shows the timing of the control signals for implementing the algorithm as illustrated in FIG. 7b.

The input data entering the DCT/IDCT processor 106 (FIG. 1) are either selected from the block memory unit 103, or from DCT row storage unit 105; the sequence in which a row of data from either source is presented to the DCT/IDCT processor 106 is described above in conjunction with the description of DCT input select unit 104.

Accordingly, at clock period 0, elements x(2) and x(5) are latched into latch 701t, and elements x(1) and x(6) are latched into latch 702b.

At the next clock period 1, the results of the sum y3=x(2)+x(5), and the difference y7=x(1)-x(6), are latched into latches 703t and 703b respectively.

At clock period 2, elements x(3and x(4), x(0) and x(7) are latched into latches 701t and 701b respectively. At the same time, data y3 and y7 are advanced to latches 707t and 707b, y3 and y7 are replaced at latches 703t and 703b by the difference y6=x(2)-x(5), and the sum y2=x(1)+x(6) respectively.

At clock period 3, data y3 and y7 are advanced to latches 708t and 708b as data w3 and w7 respectively. At the same time, data y6 and y2 are advanced to latches 707t and 707b. Latches 703t and 703b now contains respectively, the sum y4=x(3)+x(4), and the difference y8=x(0)-x(7), resulting from operations at adders 702t and 702b respectively.

At clock period 4, data y4 and y8 advance to latches 707t and 707b, while latches 703t and 703b now contain the difference y5=x(3)-x(4), and the sum y1=x(0)+x(7). Multiplier 706 multiplies constant 2cos(pi/4) to datum y6 to form datum w6 to be latched by latch 708t, and datum y2 advances to latch 708b as w2. Datum w3 advances to latch 712t and is renamed z3. At the same time, the difference z7 =w7-y6 is latched into 712b.

It should be noted that the data is continuously being brought into the DCT/IDCT processor unit 106. Although FIG. 7c, and likewise FIG. 7f, shows no data for clock periods 4-16 residing in latches 701t and 701b, it is so shown for clear presentation to the reader. In fact, a new row or column (eight values) is brought into the DCT/IDCT processor 105 every four clock cycles. These rows or columns are alternatively selected from either DCT row storage unit 105 or block memory unit 103. For example, if the data brought into DCT/IDCT processor unit 106 during clock periods 0-3 are selected from block memory unit 103, the data brought into DCT/IDCT processor unit 106 during clock period 4-7 is from the DCT row storage unit 105. In other words, the pipelines are always filled.

At clock period 5, data y5 and y1 advance to 707t and 707b; data y4 and y8 advance to latches 708t and 708b to become w4 and w8 respectively; data z3 and z7 advance to latches 714t and 714b respectively; and, data w6 and w2 advance to latches
712t and 712b respectively to become z6 and z2.

At clock period 6, data z3 and z7 advance to latches 715t and 715b respectively; data z6 and z2 advance to latches 714t and 714b respectively; datum w4 advance to latch 712t and becomes z4, and z8=w8-y5 is latched into 712b as a result of subtraction at adder 710. At the same time, datum y1 is latched at latch 708b as w1, datum y5 has completed multiplication at multiplier 706 with the constant 2cos(pi/4) and latched at latch 708t.

At clock period 7, all data advance to the next latch in their respective data paths, to result in data z4, z6 and z3 in latches 714t, 715t and 716t respectively, and z8, z2 and z7 in latches 714b, 715b, and 716b respectively. The data w5 and w1
advance to latches 712t and 712b as data z5 and z1 respectively.

At clock period 8, all data advance one latch in their respective data path, so that data z1 thorough z8 are each stored in one of the temporary latches 714t through 720t in the 700 a data pat