United States Patent6097842
Suzuki , ; et al.August 1, 2000

Title

Picture encoding and/or decoding apparatus and method for providing scalability of a video object whose position changes with time and a recording medium having the same recorded thereon

Abstract

An apparatus and method for obtaining scalability of a video object (VO) whose position and/or size changes with time. The position of an upper layer picture and that of a lower layer picture in an absolute coordinate system are determined so that corresponding pixels in an enlarged picture and in the upper layer picture may be arranged at the same positions in the absolute coordinate system.


Inventors:Suzuki; Teruhiko (Chiba, JP), Yagasaki; Yoichi  (Kanagawa, JP)
Assignee:Sony Corporation (Tokyo, JP)
Appl. No.:924778
Filed:September 5, 1997
Foreign Application Priority Data

Sep 09, 1996 [JP] 8-260312
Sep 20, 1996 [JP] 8-271512

Current U.S. Class:382/232 
Field of Search:348/403,404,405,412,413,414,415,390,399 358/426,427,428,429,430,431,432,433,457,260,261,262 382/232,233,234,235,236,237,238,239,250

U.S. Patent Documents
5708732January 1998Marhav et al.
5751358May 1998Suzuki et al.
5757968May 1998Ando
5767986June 1998Kondo et al.
5805914September 1998Wise et al.
5832121November 1998Ando
5881301March 1999Robbins
5883672March 1999Suzuki et al.
5886794March 1999Kondo et al.
5905845May 1999Okada et al.
5912708June 1999Kondo et al.
5923869July 1999Kashiwagi et al.
5926224July 1999Nagasawa
5937138August 1999Fukuda et al.
5959672September 1999Sasaki
6028634February 2000Yamaguchi et al.
Primary Examiner: Tadayon; Bijan
Assistant Examiner: Alavi; Amir
Attorney, Agent or Firm:Frommer Lawrence & Haug, LLP. Frommer; William S. Smid; Dennis M.

Claims


What is claimed is:
1. A picture encoding device for encoding a first picture using a second picture different in resolution from the first picture, said picture encoding device comprises:
enlarging/contracting means for enlarging or contracting said second picture based on the difference in resolution between the first and second pictures;
first picture encoding means for predictive coding said first picture using an output of said enlarging/contracting means as a reference picture;
second picture encoding means for encoding said second picture;
position setting means for setting positions of said first picture and said second picture in a pre-set absolute coordinate system and for outputting the first position information or the second position information of the position of said first or second picture, respectively; and
multiplexing means for multiplexing outputs of said first picture encoding means, said second picture encoding means, and said position setting means;
in which said first picture encoding means recognizes the position of said first picture based on said first position information and converts said second position information in response to an enlarging ratio or a contracting ratio by which said enlarging/contracting means has enlarged or contracted said second picture to obtain a position of said reference picture so as to perform predictive coding.

2. A picture encoding method for encoding a first picture using a second picture different in resolution from the first picture, said picture encoding method comprising the steps of:
enlarging or contracting said second picture based on the difference in resolution between the first and second pictures by using an enlarging/contracting device;
predictive coding said first picture using an output of said enlarging/contracting device as a reference picture by utilizing a first picture encoding device;
encoding said second picture by utilizing a second picture encoding device;
setting the positions of said first picture and said second picture in a pre-set absolute coordinate system and outputting the first position information or the second position information on the position of said first or second picture, respectively, by use of a position setting device; and
multiplexing outputs of said first picture encoding device, said second picture encoding device, and said position setting device;
in which said first picture encoding device is caused to recognize the position of said first picture based on said first position information and convert said second position information in response to an enlarging ratio or a contracting ratio by which said enlarging/contracting device has enlarged or contracted said second picture to obtain a position of said reference picture so as to perform predictive coding.

3. A picture decoding device for decoding encoded data obtained on predictive encoding of a first picture using a second picture different in resolution from said first picture, said picture decoding device comprises:
second picture decoding means for decoding said second picture;
enlarging/contracting means for enlarging/contracting said second picture decoded by said second picture decoding means based on the difference in resolution between said first and second pictures; and
first picture decoding means for decoding said first picture using an output of said enlarging/contracting means as a reference picture;
in which said encoded data includes first or second position information pertaining to the position of said first picture or said second picture in a pre-set absolute coordinate system; and
in which said first picture decoding means recognizes the position of said first picture based on said first position information and converts said second position information in response to an enlarging ratio or a contracting ratio by which said enlarging/contracting means has enlarged or contracted said second picture to obtain a position of said reference picture so as to decode said first picture.

4. The picture decoding device as in claim 3, further comprising display means for displaying the decoding results of said first picture decoding means.

5. A picture decoding method for decoding encoded data obtained on predictive encoding of a first picture using a second picture different in resolution from said first picture, said picture decoding method comprising the steps of:
decoding said second picture by using a device second picture decoding device;
enlarging/contracting said second picture decoded by said second picture decoding device based on the difference in resolution between said first and second pictures by using an enlarging/contracting device; and
decoding said first picture using an output of said enlarging/contracting device as a reference picture by utilizing a first picture decoding device;
in which said encoded data includes first or second position information pertaining to the position of said first picture or said second picture in a pre-set absolute coordinate system; and
in which said first picture decoding device is caused to recognize the position of said first picture based on said first position information and convert said second position information in response to an enlarging ratio or a contracting ratio by which said enlarging/contracting device has enlarged or contracted said second picture to obtain a position of said reference picture so as to decode said first picture.

6. The picture decoding method as in claim 5, wherein the decoding results of said first picture decoding device are displayed.

7. A recording medium having recorded thereon encoded data obtained on encoding a first picture using a second picture different in resolution from the first picture said encoded data including at least first data obtained on predictive encoding said first picture using as a reference picture enlarged or contracted results obtained on enlarging or contracting said second picture based on the difference in resolution between said first and second pictures, second data obtained on encoding said second picture, and first or second position information obtained on setting positions of said first and second pictures in a pre-set absolute coordinate system; in which the position of said first picture is recognized based on said first position information and said second position information is converted in response to an enlarging ratio or contracting ratio by which said second picture has been enlarged or contracted to obtain a position of said reference picture so as to perform predictive coding.

8. A recording method for recording encoded data obtained on encoding a first picture using a second picture different in resolution from the first picture, in which said encoded data includes at least first data obtained on predictive encoding said first picture using as a reference picture enlarged or contracted results obtained on enlarging or contracting said second picture based on the difference in resolution between said first and second pictures, second data obtained on encoding said second picture, and first or second position information obtained on setting positions of said first and second pictures in a pre-set absolute coordinate system; wherein the position of said first picture is recognized based on said first position information and said second position information is converted in response to an enlarging ratio or contracting ratio by which said second picture has been enlarged or contracted to obtain a position of said reference picture so as to perform predictive coding.

9. A picture encoding device for encoding a first picture using a second picture different in resolution from the first picture, said picture encoding device comprises:
enlarging/contracting means for enlarging or contracting said second picture based on the difference in resolution between the first and second pictures;
first picture encoding means for predictive coding of said first picture using an output of said enlarging/contracting means as a reference picture;
second picture encoding means for encoding said second picture;
position setting means for setting positions of said first picture and said second picture in a pre-set absolute coordinate system and for outputting the first position information or the second position information of the position of said first or second picture, respectively; and
multiplexing means for multiplexing outputs of said first picture encoding means, said second picture encoding means, and said position setting means;
in which said position setting means sets the positions of said first and second pictures so that a position of said reference picture in said pre-set absolute coordinate system will be coincident with a pre-set position; and
in which said first picture encoding means recognizes the position of said first picture based on the first position information and recognizes the pre-set position to obtain a position of said reference picture so as to perform predictive coding.

10. A picture encoding method for encoding a first picture using a second picture different in resolution from the first picture, said picture encoding method comprising the steps of:
enlarging or contracting said second picture based on the difference in resolution between the first and second pictures by using an enlarging/contracting device;
predictive coding of said first picture using an output of said enlarging/contracting device as a reference picture by utilizing a first picture encoding device;
encoding said second picture by using a second picture encoding device;
setting the positions of said first picture and said second picture in a pre-set absolute coordinate system and outputting the first position information or the second position information on the position of said first or second picture, respectively, by use of a position setting device; and
multiplexing outputs of said first picture encoding device, said second picture encoding device, and said position setting device;
in which said position setting device is caused to set the positions of said first and second pictures so that a position of said reference picture in said pre-set absolute coordinate system will be coincident with the pre-set position; and
in which said first picture encoding device is caused to recognize the position of said first picture based on said first position information and to recognize said pre-set position to obtain a position of said reference picture so as to perform predictive coding.

11. A picture decoding device for decoding encoded data obtained on predictive encoding of a first picture using a second picture different in resolution from said first picture, said picture decoding device comprises:
second picture decoding means for decoding said second picture;
enlarging/contracting means for enlarging/contracting said second picture decoded by said second picture decoding means based on the difference in resolution between said first and second pictures; and
first picture decoding means for decoding said first picture using an output of said enlarging/contracting means as a reference picture;
in which said encoded data includes first or second position information pertaining to the position of said first picture or said second picture, respectively, in a pre-set absolute coordinate system;
in which the position of said reference picture in said pre-set absolute coordinate system has been set so as to be coincident with a pre-set position; and
in which said first picture decoding means recognizes the position of said first picture based on said first position information and recognizes the pre-set to obtain a position of said reference picture so as to decode said first picture.

12. The picture decoding device as in claim 11, further comprising display means for displaying the decoding results of said first picture decoding means.

13. A picture decoding method for decoding encoded data obtained on predictive encoding of a first picture using a second picture different in resolution from said first picture, said picture decoding method comprising the steps of:
decoding said second picture by using a second picture decoding device;
enlarging/contracting said second picture decoded by said second picture decoding device based on the difference in resolution between said first and second pictures by using an enlarging/contracting device; and
decoding said first picture using an output of said enlarging/contracting device as a reference picture by utilizing a first picture decoding device;
in which said encoded data includes first or second position information pertaining to the position of said first picture or said second picture in a pre-set absolute coordinate system;
in which the position of said reference picture in said pre-set coordinate system has been set so as to coincide with a pre-set position; and
in which said first picture decoding device is caused to recognize the position of said first picture based on the first position information and to recognize the pre-set position to obtain a position of said reference picture so as to decode said first picture.

14. The picture decoding method as in claim 13, wherein the decoding results of said first picture decoding device are displayed.

15. The picture encoding device as in claim 1, wherein said multiplexing means multiplexes difference values obtained between values of the first position information and values of the second position information.

16. The picture encoding device as in claim 1, wherein if said first picture or said second picture is changed in size, said multiplexing means multiplexes first size information of said first picture or second size information of said second picture.

17. The picture encoding device as in claim 1, wherein said multiplexing means multiplexes difference values obtained between values of first size
information of said first picture and values of the second size information of said second picture.

18. The picture decoding device as in claim 3, wherein said encoded data includes difference values obtained between values of first size information of said first picture and values of second size information of said second picture.

19. The picture decoding device as in claim 3, wherein if said first picture or said second picture is changed in size, said encoded data includes the first size information of said first picture and the second size information of said second picture.

20. The picture decoding device as in claim 19, wherein said encoded data includes difference values obtained between values of the first size information and values of the second size information.

21. A recording medium as in claim 7, wherein said encoded data includes difference values obtained between values of the first position information and values of the second position information.

22. A recording medium as in claim 7, wherein if said first picture or said second picture is changed in size, said encoded data includes the first size information of said first picture or the second size information of said second picture, respectively.

23. The recording medium as in claim 22, wherein said encoded data includes difference values obtained between values of first size information of said first picture and values of second size information of said second picture.

24. The picture encoding device as in claim 9, wherein said multiplexing means multiplexes difference values obtained between values of first size information of and values of the second size information.

25. The picture encoding device as in claim 9, wherein if said first picture or said second picture is changed in size, said multiplexing means multiplexes first size information of said first picture or second size information of said second picture.

26. The picture encoding device as in claim 25, wherein said multiplexing means multiplexes difference values obtained between values of the first size information and values of the second size information.

27. The picture decoding device as in claim 11, wherein said encoded data includes difference values obtained between values of first size information of said first picture and values of second size information of said second picture.

28. The picture decoding device as in claim 11, wherein if said first picture or said second picture is changed in size, said encoded data includes the first size information of said first picture and the second size information of said second picture.

29. The picture decoding device as in claim 28, wherein said encoded data includes difference values obtained between values of the first size information and values of the second size information.

Description

BACKGROUND OF THE INVENTION

The present invention relates to a picture encoding and decoding technique, a picture processing technique, a recording technique, and a recording medium and, more particularly, to such techniques and recording medium for use in recording moving picture data onto a recording medium, such as a magneto-optical disc or a magnetic tape, reproducing the recorded data for

display on a display system, or transmitting the moving picture data over a transmission channel from a transmitter to a receiver and receiving and displaying the transmitted data by the receiver or editing the received data for recording, as in a teleconferencing system, video telephone system, broadcast equipment, or in a multi-media database retrieving system.

In a system for transmitting moving picture data to a remote place, as in a teleconferencing system or video telephone system, picture data may be encoded (compressed) by exploiting or utilizing line correlation and inter-frame correlation. A high-efficiency encoding system for moving pictures has been proposed by the Moving Picture Experts Group (MPEG). Such system has been proposed as a standard draft after discussions in ISO-1EC/JTC1/SC2/WG11, and is a hybrid system combined from the motion compensation predictive coding and discrete cosine transform (DCT).

In MPEG, several profiles and levels are defined for coping with various types of applications and functions. The most basic is the main profile main level (MOVING PICTURE ML (Main Profile @ at main Level)).

FIG. 1 illustrates a MP@ ML encoding unit in an MPEG system. In such encoding unit, picture data to be encoded is supplied to a frame memory 31 for transient storage therein. A motion vector detector 32 reads out picture data stored in the fame memory 31 in terms of a 16.times.16 pixel macro-block basis so as to detect its motion vector. The motion vector detector 32 processes picture data of each frame as an I-picture, a P-picture, or as a B-picture. Each of the pictures of the sequentially entered frames is processed as one of the I-, P- or B-pictures as a pre-set manner, such as in a sequence of I, B, P, B, P, . . . , B, P. That is, the motion vector detector 32 refers to a pre-set reference frame in a series of pictures stored in the frame memory 31 and detects the motion vector of a macro-block, that is, a small block of 16 pixels by 16 lines of the frame being encoded by pattern matching (block matching) between the macro-block and the reference frame for detecting the motion vector of the macro-block.

In MPEG, there are four picture prediction modes, that is, an intra-coding (intra-frame coding), a forward predictive coding, a backward predictive coding, and a bidirectional predictive-coding. An I-picture is an intra-coded picture, a P-picture is an intra-coded or forward predictive coded or backward predictive coded picture, and a B-picture is an intra-coded, a forward predictive coded, or a bidirectional predictive-coded picture.

Returning to FIG. 1, the motion vector detector 32 performs forward prediction on a P-picture to detect its motion vector. The motion vector detector 32 compares prediction error produced by performing forward prediction to, for example, the variance of the macro-block being encoded (macro-block of the P-picture). If the variance of the macro-block is smaller than the prediction error, the intra-coding mode is set as the prediction mode and outputted to a variable length coding (VLC) unit
36 and to a motion compensator 42. On the other hand, if the prediction error generated by the forward prediction coding is smaller, the motion vector detector 32 sets the forward predictive coding mode as the prediction mode and outputs the set mode to the VLC unit 36 and the motion compensator 42 along with the detected motion vector. Additionally, the motion vector detector 32 performs forward prediction, backward prediction, and bi-directional prediction for a B-picture to detect the respective motion vectors. The motion vector detector 32 detects the smallest prediction error of forward prediction, backward prediction, and bidirectional prediction (referred to herein as minimum prediction error) and compares the minimum prediction error), for example, the variance of the macro-block being encoded (macro-block of the B-picture). If, as a result of such comparison, the variance of the macro-block is smaller than the minimum prediction error, the motion vector detector 32 sets the intra-coding mode as the prediction mode, and outputs the set mode to the VLC unit 36 and the motion compensator 42. If, on the other hand, the minimum prediction error is smaller, the motion vector detector 32 sets the prediction mode for which the minimum prediction error has been obtained, and outputs the prediction mode thus set to the VLC unit 36 and the motion compensator 42 along with the associated motion vector.

Upon receiving the prediction mode and the motion vector from the motion vector detector 32, the motion compensator 42 may read out encoded and already locally decoded picture data stored in the frame memory 41 in accordance with the prediction mode and the motion vector and may supply the read-out data as a prediction picture to arithmetic units 33 and 40. The arithmetic unit 33 also receives the same macro-block as the picture data read out by the motion vector detector 32 from the frame memory 31 and calculates the difference between the macro-block and the prediction picture from the motion compensator 42. Such difference value is supplies to a discrete cosine transform (DCT) unit 34.

If only the prediction mode is received from the motion vector detector 32, that is, if the prediction mode is the intra-coding mode, the motion compensator 42 may not output a prediction picture. In such situation, the arithmetic unit 33 may not perform the above-described processing, but instead may directly output the macro-block read out from the frame memory 31 to the DCT unit 34. Also, in such situation, the arithmetic unit 40 may perform in a similar manner.

The DCT unit 34 performs DCT processing on the output signal from the arithmetic unit 33 so as to obtain DCT coefficients which are supplied to a quantizer 35. The quantizer 35 sets a quantization step (quantization scale) in accordance with the data storage quantity in a buffer 37 (data volume stored in the buffer 37) received as a buffer feedback and quantizes the DCT coefficients from the DCT unit 34 using the quantization step. The quantized DCT coefficients (sometimes referred to herein as quantization coefficients) are supplied to the VLC unit 36 along with the set quantization step.

The VLC unit 36 converts the quantization coefficients supplied from the quantizer 35 into a variable length code, such a Huffman code, in accordance with the quantization step supplied from the quantizer 35. The resulting converted quantization coefficients are outputted to the buffer 37. The VLC unit 36 also variable length encodes the quantization step from the quantizer 35, prediction mode from the motion vector detector 32, and the motion vector from the motion vector detector 32, and outputs the encoded data to the buffer 37. It should be noted that the prediction mode is a mode specifying which of the intra-coding, forward predictive coding, backward predictive coding, or bidirectionally predictive coding has been set.

The buffer 37 transiently stores data from the VLC unit 36 and smooths out the data volume so as to enable smoothed data to be outputted therefrom and supplied to a transmission channel or to be recorded on a recording medium or the like. The buffer 37 may also supply the stored data volume to the quantizer 35 which sets the quantization step in accordance therewith. As such, in the case of impending overflow of the buffer 37, the quantizer 35 increases the quantization step size so as to decrease the data volume of the quantization coefficients. Conversely, in the case of impending underflow of the buffer 37, the quantizer 35 decreases the quantization step size so as to increase the data volume of the quantization coefficients. As is to be appreciated, this procedure may prevent overflow and underflow of the buffer 37.

The quantization coefficients and the quantization step outputted by the quantizer 35 are supplied not only to the VLC unit 36, but also to a dequantizer 38 which dequantizes the quantization coefficients in accordance with the quantization step so as to convert the same to DCT coefficients. Such DCT coefficients are supplied to an IDCT (inverse DCT) unit 39 which performs inverse DCT on the DCT coefficients. The obtained inverse DCTed coefficients are supplied to the arithmetic unit 40.

The arithmetic unit 40 receives the inverse DCT coefficients from the IDCT unit 39 and data from the motion compensator 42 which are the same as the prediction picture sent to the arithmetic unit 33. The arithmetic unit 40 sums the signal (prediction residuals) from the IDCT unit 39 to the prediction picture from the motion compensator 42 to locally decode the original picture. However, if the prediction mode indicates intra-coding, the output of the IDCT unit 39 may be fed directly to the frame memory 41. The decoded picture (locally decoded picture) obtained by the arithmetic unit 40 is sent to and stored in the frame memory 41 so as to be used later as a reference picture for an inter-coded picture, forward predictive coded picture, backward predictive code picture, or a bidirectional predictive code picture.

The decoded picture obtained from the arithmetic unit 40 is the same as that which may be obtained from a receiver or decoding unit (not shown in FIG. 1).

FIG. 2 illustrates a MP @ ML decoder in an MPEG system for decoding encoded data such as that outputted by the encoder of FIG. 1. In such decoder, encoded data transmitted via a transmission path may be received by a receiver (not shown) or encoded data recorded on a recording medium may be reproduced by a reproducing device (not shown) and supplied to a buffer 101 and stored thereat. An IVLC unit (inverse VLC unit) 102 reads out encoded data stored in the buffer 101 and variable length decodes the same so as to separate the encoded data into a motion vector, prediction mode, quantization step and quantization coefficients. Of these, the motion vector and the prediction mode are supplied to a motion compensator 107, while the quantization step and quantization coefficients are supplied to a dequantizer 103. The dequantizer 103 dequantizes the quantization coefficients in accordance with the quantization step so as to obtain DCT coefficients which are supplied to an IDCT (inverse DCT) unit 104. The IDCT unit 104 performs an inverse DCT operation on the received DCT coefficients and supplies the resulting signal to an arithmetic unit 105. In addition to the output of the IDCT unit 104, the arithmetic unit 105 also receives an output from a motion compensator 107. That is, the motion compensator 107 reads out a previously decoded picture stored in a frame memory 106 in accordance with the prediction mode and the motion vector from the IVLC unit 102 in a manner similar to that of the motion compensator 42 of FIG. 1 and supplies the read-out decoded picture as a prediction picture to the arithmetic unit 105. The arithmetic unit 105 sums the signal from the IDCT unit 104 (prediction residuals) to the prediction picture from the motion compensator 107 so as to decode the original picture. If the output of the IDCT unit 104 is intra-coded, such output may be directly supplied to and stored in the frame memory 106. The decoded picture stored in the frame memory
106 may be used as a reference picture for subsequently decoded pictures, and also may be read out and supplied to a display (not shown) so as to be displayed thereon. However, if the decoded picture is a B-picture, such B-picture is not stored in the frame memories 41 (FIG. 1) or 106 (FIG. 2) in the encoding unit or decoder, since a B-picture is not used as a reference picture in MPEG1 and MPEG2.

In MPEG, a variety of profiles and levels as well as a variety of tools are defined in addition to the above-described MP@ML. An example of a MPEG tool is scalability. More specifically, MPEG adopts a scalable encoding system for coping with different picture sizes or different frame sizes. In spatial scalability, if only a lower-layer bitstream is decoded, for example, only a picture with a small picture size is obtained, whereas, if both lower-layer and upper-layer bitstreams are decoded, a picture with a large picture size is obtained.

FIG. 3 illustrates an encoding unit for providing spatial scalability. In spatial scalability, the lower and upper layers are associated with picture signals of a small picture size and those with a large picture size, respectively. The upper-layer encoding unit 201 may receive an upper-layer picture for encoding, whereas, the lower-layer encoding unit 202 may receive a picture resulting from a thinning out process for reducing the number of pixels (hence a picture lowered in resolution for diminishing its size) as a lower-layer picture. The lower-layer encoding unit 202 predictively encodes a lower-layer picture in a manner similar to that of FIG. 1 so as to form and output a lower-layer bitstream. The lower-layer encoding unit 202
also generates a picture corresponding to the locally decoded lower-layer picture enlarged to the same size as the upper-layer picture size (occasionally referred to herein as an enlarged picture). This enlarged picture is supplied to the upper-layer encoding unit 201. The upper-layer encoding unit 201 predictively encodes an upper-layer picture in a manner similar to that of FIG. 1 so as to form and output an upper-layer bitstream. The upper layer encoding unit 201 also uses the enlarged picture received from the lower-layer encoding unit 202 as a reference picture for executing predictive coding. The upper layer bitstream and the lower layer bitstream are multiplexed to form encoded data which is outputted.

FIG. 4 illustrates an example of the lower layer encoding unit 202 of FIG. 3. Such lower layer encoding unit 202 is similarly constructed to the encoder of FIG. 1 except for an upsampling unit 211. Accordingly, in FIG. 4, parts or components corresponding to those shown in FIG. 1 are depicted by the same reference numerals. The upsampling unit 211 upsamples (interpolates) a locally decoded lower-layer picture outputted by the arithmetic unit 40 so as to enlarge the picture to the same size as the upper layer picture size and supplies the resulting enlarged picture to the upper layer encoding unit 201.

FIG. 5 illustrates an example of the upper layer encoding unit 201 of FIG. 3. Such upper layer encoding unit 201 is similarly constructed to the encoder of FIG. 1 except for weighing addition units 221, 222 and an arithmetic unit 223. Accordingly, in FIG. 5, parts or components corresponding to those of FIG. 1 are denoted by the same reference numerals. The weighing addition unit 221 multiplies a prediction picture outputted by the motion compensator 42 by a weight W and outputs the resulting signal to the arithmetic unit 223. The weighing addition unit 222 multiplies the enlarged picture supplied from the lower layer encoding unit 202 with a weight (1-W) and supplies the resulting product to the arithmetic unit 223. The arithmetic unit 223 sums the received outputs from the weight addition circuits 221, 222 and outputs the resulting sum to the arithmetic units 33, 40 as a predicted picture. The weighing W used in the weighing addition unit 221 is pre-set, as is the weighing (1-W) used in the weighing addition unit 222. The weighing W is supplied to the VLC unit 36 for variable length encoding. The upper layer encoding unit 201 performs processing similar to that of FIG. 1.

Thus the upper layer encoding unit 201 performs predictive encoding using not only the upper layer picture, but also the enlarged picture from the lower layer encoding unit 202, that is, a lower layer picture, as a reference picture.

FIG. 6 illustrates an example of a decoder for implementing spatial scalability. Output encoded data from the encoder of FIG. 3 is separated into an upper layer bitstream and a lower layer bitstream which are supplied to an upper layer decoding unit 231 and to a lower layer decoding unit 232, respectively. The lower layer decoding unit 232 decodes the lower layer bitstream as in FIG. 2 and outputs the resulting decoded picture of the lower layer. In addition, the lower layer decoding unit 232
enlarges the lower layer decoded picture to the same size as the upper layer picture to generate an enlarged picture and supplies the same to the upper layer decoding unit 231. The upper layer decoding unit 231 similarly decodes the upper layer bitstream, as in FIG. 2. However, the upper layer decoding unit 231 decodes the bitstream using the enlarged picture from the lower layer decoding unit 232 as a reference picture.

FIG. 7 illustrates an example of the lower layer decoding unit 232. The lower layer decoding unit 232 is similarly constructed to the decoder of FIG. 2 except for an upsampling unit 241. Accordingly, in FIG. 7, parts or components corresponding to those of FIG. 2 are depicted by the same reference numerals. The upsampling unit 241 upsamples (interpolates) the decoded lower layer picture outputted by the arithmetic unit 105 so as to enlarge the lower layer picture to the same size as the upper layer picture size and outputs the enlarged picture to the upper layer decoder

231.

FIG. 8 illustrates an example of the upper layer decoding unit 231 of FIG. 6. The upper layer decoding unit 231 is similarly constructed to the encoder of FIG. 2 except for weighing addition units 251, 252 and an arithmetic unit 253. Accordingly, in FIG. 7, parts or components corresponding to those of FIG. 2 are depicted by the same reference numerals. In addition to performing the processing explained with reference to FIG. 2, the IVLC unit 102 extracts the weighing W from the encoded data and outputs the extracted weighing W to the weighing addition units 251, 252. The weighing addition unit 251 multiplies the prediction picture outputted by the motion compensator 107 by the weighing W and outputs the resulting product to the arithmetic unit 253. The arithmetic unit 253 also receives an output from the weighing addition unit 252. Such output is obtained by multiplying the enlarged picture supplied from the lower layer decoding unit 232 by the weighing (1-W). The arithmetic unit 253 sums the outputs of the weighing summing units 251, 252 and supplies the summed output as a prediction picture to the arithmetic unit 105. Therefore, the arithmetic unit 253 uses the upper layer picture and the enlarged picture from the lower layer encoding unit 232, that is, the lower layer picture, as reference pictures, for decoding. Such processing is performed on both luminance signals and chroma signals. The motion vector for the chroma signals may be one-half as large as the motion vector for the luminance signals.

In addition to the above-described MPEG system, a variety of high-efficiency encoding systems have been standardized for moving pictures. In ITU-T, for example, systems such as H.261 or H.263 have been prescribed mainly as encoding systems for communication. Similar to the MPEG system, these H.261 and H.263 systems basically involve a combination of motion compensation prediction encoding and DCT encoding. Specifically, the H.261 and H.263 systems may be basically similar in structure to the encoder or the decoder of the MPEG system, although differences in the structure thereof or in the details such as header information may exist.

In a picture synthesis system for constituting a picture by synthesizing plural pictures, a so-called chroma key technique may be used. This technique photographs an object in front of a background of a specified uniform color, such as blue, extracts an area other than the blue therefrom, and synthesizes the extracted area to another picture. The signal specifying the extracted area is termed a key signal.

FIG. 9 illustrates a method for synthesizing a picture where F1 is a background picture and F2 is a foreground picture. The picture F2 is obtained by photographing an object, herein a person, and extracting an area other than this color. The chroma signal K1 specifies the extracted area. In the picture synthesis system, the background picture F1 and the foreground picture F2 are synthesized in accordance with the key signal K1 to generate a synthesized picture F3. This synthesized picture is encoded, such as by a MPEG technique, and transmitted.

If the synthesized picture F3 is encoded and transmitted as described above, only the encoded data on the synthesized picture F3 is transmitted, so that the information such as the key signal K1 may be lost. As such, picture re-editing or re-synthesis for keeping the foreground F2 intact and changing only the background F1 becomes difficult to perform on the receiving side.

Consider a method in which the pictures F1, F2 and the key signals K1 are separately en coded and the resulting respective bitstreams are multiplexed as shown, for example, in FIG. 10. In such case, the receiving side demultiplexes the multiplexed data to decode the respective bitstreams and produce the pictures F1, F2 or the key signal K1. The decoded results of the pictures F1, F2 or the key signal K1 may be synthesized so as to generate the synthesized picture F3. In such case, the receiving side may perform picture re-editing or re-synthesis such that the foreground F2 is kept intact and only the background F1 is changed.

Therefore, the synthesized picture F3 is made up of the pictures F1 and F2. In a similar manner, any picture may be thought of as being made up of plural pictures or objects. If units that go to make up a picture are termed video objects (VOs), an operation for standardizing a VO based encoding system is underway in ISO-IEC/JTC1/SC29/WG11 as MPEG 4. However, at present, a method for efficiently encoding a VO or encoding key signals has not yet been established and is in a pending state. In any event, although MPEG 4 prescribes the function of scalability, there has not been proposed a specified technique for realization of scalability for a VO in which the position and size thereof change with time. As an example, if the VO is a person approaching from a distant place, the position and the size of the VO change with time. Therefore, if a picture of a lower layer is used as a reference picture in predictive encoding of the upper layer picture, it may be necessary to clarify the relative position between the picture of the upper layer and the lower layer picture used as a reference picture. On the other hand, in using VO-based scalability, the condition for a skip macro-block of the lower layer is not necessarily directly applicable to that for a skip macro-block of the lower layer.

OBJECTS AND SUMMARY OF THE INVENTION

It is therefore an object of the present invention to provide a technique which enables VO-based encoding to be easily achieved.

In accordance with an aspect of the present invention, a picture encoding device is provided which includes enlarging/contracting means for enlarging or contracting a second picture based on the difference in resolution between first and second pictures (such as a resolution converter 24 shown in FIG. 15), first picture encoding means for predictive coding the first picture using an output of the enlarging/contracting means as a reference picture (such as an upper layer encoding unit 23 shown in FIG. 15), second picture encoding means for encoding the second picture (such as a lower layer encoding unit 25), position setting means for setting the positions of the first picture and the second picture in a pre-set absolute coordinate system and outputting first or second position information on the position of the first or second picture, respectively (such as a picture layering unit 21 shown in FIG. 15), and multiplexing means for multiplexing outputs of the first picture encoding means, second picture encoding means, and the position setting means (such as a multiplexer 26 shown in FIG. 15). The first picture encoding means recognizes the position of the first picture based on the first position information and converts the second position information in response to an enlarging ratio or a contracting ratio by which the enlarging/contracting means has enlarged or contracted the second picture. The first picture encoding means also recognizes the position corresponding to the results of conversion as the position of the reference picture in order to perform predictive coding.

In accordance with another aspect of the present invention, a picture encoding device for encoding is provided which includes enlarging/contracting means for enlarging or contracting a second picture based on the difference in resolution between first and second pictures (such as the resolution converter 24 shown in FIG. 15), first picture encoding means for predictive coding the first picture using an output of the enlarging/contracting means as a reference picture (such as the upper layer encoding unit 23 shown in FIG. 15), second picture encoding means for encoding the second picture (such as the lower layer encoding unit 25), position setting means for setting the positions of the first picture and the second picture in a pre-set absolute coordinate system and outputting first or second position information on the position of the first or second picture, respectively (such as the picture layering unit 21 shown in FIG. 15), and multiplexing means for multiplexing outputs of the first picture encoding means, second picture encoding means, and the position setting means (such as the multiplexer 26 shown in FIG. 15). The first picture encoding means is caused to recognize the position of the first picture based on the first position information and to convert the second position information in response to an enlarging ratio or a contracting ratio by which the enlarging/contracting means has enlarged or contracted the second picture. The first picture encoding means recognizes the position corresponding to the results of conversion as the position of the reference picture in order to perform predictive coding.

In accordance with the above picture encoding device and a picture encoding method, the enlarging/contracting means enlarges or contracts the second picture based on the difference in resolution between the first and second pictures, while the first picture encoding means predictively encodes the first picture using an output of the enlarging/contracting means as a reference picture. The position setting means sets the positions of the first picture and the second picture in a pre-set absolute coordinate system and outputs the first position information or the second position information on the position of the first or second picture, respectively. The first picture encoding means recognizes the position of the first picture, based on the first position information, and converts the second position information responsive to an enlarging ratio or a contracting ratio by which the enlarging/contracting means has enlarged or contracted the second picture. The first picture encoding means recognizes the position corresponding to the results of conversion as the position of the reference picture in order to perform predictive coding.

In accordance with another aspect of the present invention, a picture decoding device is provided which includes second picture decoding means for decoding a second picture (such as a lower layer decoding unit 95), enlarging/contracting means for enlarging/contracting the second picture decoded by the second picture decoding means based on the difference in resolution between first and second pictures (such as a resolution converter 94 shown in FIG. 29), and first picture decoding means for decoding the first picture using an output of the enlarging/contracting means as a reference picture (such as an upper layer decoding unit 93 shown in FIG. 29). The encoded data includes first or second position information on the position of the first and second picture, respectively, in a pre-set absolute coordinate system. The first picture decoding means recognizes the position of the first picture based on the first position information and converts the second position information in response to an enlarging ratio or a contracting ratio by which the enlarging/contracting means has enlarged or contracted the second picture. The first picture decoding means also recognizes the position corresponding to the results of conversion as the position of the reference picture in order to decode the first picture.

The above picture decoding device may include a display for displaying decoding results of the first picture decoding means (such as a monitor 74 shown in FIG. 27).

In accordance with another aspect of the present invention, a picture decoding device is provided which includes second picture decoding means for decoding a second picture (such as a lower layer decoding unit 95 shown in FIG. 29), enlarging/contracting means for enlarging/contracting the second picture decoded by the second picture decoding means based on the difference in resolution between first and second pictures (such as a resolution converter 94 shown in FIG. 29), and first picture decoding means for decoding the first picture using an output of the enlarging/contracting means as a reference picture (such as an upper layer decoding unit 93). The encoded data includes first and second position information on the position of the first and the second picture, respectively, in a pre-set absolute coordinate system. The first picture decoding means is caused to recognize the position of the first picture based on the first position information and to convert the second position information in response to an enlarging ratio or a contracting ratio by which the enlarging/contracting means has enlarged or contracted the second picture. The first picture encoding means recognizes the position corresponding to the results of conversion as the position of the reference picture in order to decode the first picture.

In accordance with the above picture decoding device and a picture decoding method, the enlarging/contracting means enlarges or contracts the second picture decoded by the second picture decoding means based on the difference in resolution between the first and second pictures. The first picture decoding means decodes the first picture using an output of the enlarging/contracting means as a reference picture. If the encoded data includes the first position information or the second position information on the position of the first picture and on the position of the second picture, respectively, in a pre-set absolute coordinate system, the first picture decoding means recognizes the position of the first picture, based on the first position information, and converts the second position information responsive to an enlarging ratio or a contracting ratio by which the enlarging/contracting means has enlarged or contracted the second picture. The first picture decoding means recognizes the position corresponding to the results of conversion as the position of the reference picture, in order to decode the first picture.

In accordance with another aspect of the present invention, a recording medium is provided which has recorded thereon encoded data including first data obtained on predictive encoding a first picture using, as a reference picture, the enlarged or contracted results obtained on enlarging or contracting a second picture based on the difference in resolution between the first and second pictures, second data obtained on encoding the second picture, and first position information or second position information obtained on setting the positions of the first and second pictures in a pre-set absolute coordinate system. The first data is obtained on recognizing the position of the first picture based on the first position information, converting the second position information in response to the enlarging ratio or contracting ratio by which the second picture has been enlarged or contracted, and on recognizing the position corresponding to the results of conversion as the position of the reference picture in order to perform predictive coding.

In accordance with another aspect of the present invention, a method for recording encoded data is provided wherein, the encoded data includes first data obtained on predictive encoding a first picture using, as a reference picture, the enlarged or contracted results obtained on enlarging or contracting a second picture based on the difference in resolution between the first and second pictures, second data obtained on encoding the second picture, and first position information or second position information obtained on setting the positions of the first and second pictures in a pre-set absolute coordinate system. The first data is obtained on recognizing the position of the first picture based on the first position information, converting the second position information in response to the enlarging ratio or contracting ratio by which the second picture has been enlarged or contracted and on recognizing the position corresponding to the results of conversion as the position of the reference picture in order to perform predictive coding.

In accordance with another aspects of the present invention, a picture encoding device is provided which includes enlarging/contracting means for enlarging or contracting a second picture based on the difference in resolution between first and second pictures (such as the resolution converter 24 shown in FIG. 15), first picture encoding means for predictive coding the first picture using an output of the enlarging/contracting means as a reference picture (such as the upper layer encoding unit
23 shown in FIG. 15), second picture encoding means for encoding the second picture (such as the lower layer encoding unit 25 shown in FIG. 15), position setting means for setting the positions of the first picture and the second picture in a pre-set absolute coordinate system and outputting the first position information or the second position information on the position of the first or second picture, respectively (such as a picture layering unit 21 shown in FIG. 15), and multiplexing means for multiplexing outputs of the first picture encoding means, second picture encoding means, and the position setting means (such as the multiplexer 26 shown in FIG. 15). The position setting means sets the positions of the first and second pictures so that the position of the reference picture in a pre-set absolute coordinate system will be

coincident with a pre-set position. The first picture encoding means recognizes the position of the first picture based on the first position information and also recognizes the pre-set position as the position of the reference picture in order to perform predictive coding.

In accordance with another aspect of the present invention, a picture encoding device for performing picture encoding is provided which includes enlarging/contracting means for enlarging or contracting a second picture based on the difference in resolution between first and second pictures (such as the resolution converter 24 shown in FIG. 15), first picture encoding means for predictive coding of the first picture using an output of the enlarging/contracting means as a reference picture (such as the upper layer encoding unit 23 shown in FIG. 15), second picture encoding means for encoding the second picture (such as the lower layer encoding unit 25 shown in FIG. 15), position setting means for setting the positions of the first picture and the second picture in a pre-set absolute coordinate system and outputting first position information or second position information on the position of the first or second picture, respectively (such as a picture layering unit 21 shown in FIG. 15), and multiplexing means for multiplexing outputs of the first picture encoding means, second picture encoding means, and the position setting means (such as the multiplexer 26 shown in FIG. 15). The position setting means causes the positions of the first and second pictures to be set so that the position of the reference picture in a pre-set absolute coordinate system will be coincident with the pre-set position. The first picture encoding means may recognize the position of the first picture as the position of the reference picture based on the first position information and to recognize the pre-set position as the position of the reference picture in order to perform predictive coding.

In accordance with the above picture encoding device and picture encoding method, the enlarging/contracting means enlarges or contracts the second picture based on the difference in resolution between the first and second pictures, while the first picture encoding means predictively encodes the first picture using an output of the enlarging/contracting means as a reference picture. The position setting means sets the positions of the first picture and the second picture in a pre-set absolute coordinate system and outputs the first position information or the second position information on the position of the first or second picture, respectively. The position setting means sets the positions of the first and second pictures so that the position of the reference picture in the pre-set absolute coordinate system will be coincident with a pre-set position. The first picture encoding means recognizes the position of the first picture based on the first position information and recognizes the pre-set position as the position of the reference picture in order to perform predictive coding.

In accordance with another aspect of the present invention, a picture decoding device for decoding encoded data is provided which includes second picture decoding means for decoding a second picture (such as an upper layer decoding unit 93 shown in FIG. 29), enlarging/contracting means for enlarging/contracting the second picture decoded by the second picture decoding means based on the difference in resolution between the first and second pictures (such as the resolution converter 94 shown in FIG. 29), and first picture decoding means for decoding the first picture using an output of the enlarging/contracting means as a reference picture (such as a lower layer decoding unit 95 shown in FIG. 29). The encoded data includes first position information or second position information on the position of the first picture or the position of the second picture, respectively, in a pre-set absolute coordinate system, in which the position of the reference picture in the pre-set absolute coordinate system has been set so as to be coincident with a pre-set position. The first picture decoding means recognizes the position of the first picture based on the first position information and recognizes the pre-position as the position of the reference picture in order to decode the first picture.

The above picture decoding device may include a display for displaying decoding results of the first picture decoding means (such as the monitor 74 shown in FIG. 27).

In accordance with another aspect of the present invention, a picture decoding device is provided which includes second picture decoding means for decoding a second picture (such as the upper layer decoding unit 93 shown in FIG. 29), enlarging/contracting means for enlarging/contracting the second picture decoded by the second picture decoding means based on the difference in resolution between first and second pictures (such as the resolution converter 94 shown in FIG. 29), and first picture decoding means for decoding the first picture using an output of the enlarging/contracting means as a reference picture (such as the lower layer decoder unit 95 shown in FIG. 29). The encoded data includes first position information or second position information on the position of the first picture or the position of the second picture in a pre-set absolute coordinate system in which the position of the reference picture in the pre-set coordinate system has been set so as to coincide with a pre-set position. The first picture decoding means is caused to recognize the position of the first picture based on the first position information and to recognize the pre-set position as the position of the reference picture in order to decode the first picture.

In accordance with the above picture decoding device and picture decoding method, the enlarging/contracting means enlarges or contracts the second picture decoded by the second picture decoding means based on the difference in resolution between the first and second pictures. If the encoded data includes the first position information or the second position information on the position of the first picture or on the position of the second picture, respectively, in a pre-set absolute coordinate system, in which the position of the reference picture in the pre-set absolute coordinate system has been set so as to be coincident with a pre-set position, the first picture decoding means recognizes the position of the first picture, based on the first position information, and recognizes the pre-position as the position of the reference picture, in order to decode the first picture.

In accordance with another aspect of the present invention, a recording medium is provided which has recorded thereon encoded data including first data obtained on predictive encoding a first picture using, as a reference picture, enlarged or contracted results obtained on enlarging or contracting a second picture based on the difference in resolution between the first and second pictures, second data obtained on encoding the second picture, and first position information or second position information obtained on setting the positions of the first and second pictures in a pre-set absolute coordinate system. The first position information and the second information having been set so that the position of the reference picture in the pre-set coordinate system will be coincident with a pre-set position.

In accordance with another aspect of the present invention, a recording method is provided for recording encoding data in which the encoded data includes first data obtained on predictive encoding a first picture using, as a reference picture, enlarged or contracted results obtained on enlarging or contracting a second picture based on the difference in resolution between the first and second pictures, second data obtained on encoding the second picture, and first position information or second position information obtained on setting the positions of the first and second pictures in a pre-set absolute coordinate system. The first position information and the second position information having been set so that the position of the reference picture in the pre-set absolute coordinate system will be coincident with a pre-set position.

In accordance with another aspect of the present invention, a picture encoding device is provided which includes first predictive coding means for predictive coding a picture (such as the lower layer encoding unit 25 shown in FIG. 15), local decoding means for locally decoding the results of predictive coding by the first predictive coding means (such as the lower layer encoding unit 25), second predictive coding means for predictive coding the picture using a locally decoded picture outputted by the local decoding means as a reference picture (such as the upper layer encoding unit 23 shown in FIG. 15), and multiplexing means for multiplexing the results of predictive coding by the first and second predictive coding means with only the motion vector used by the first predictive coding means in performing predictive coding (such as the multiplexer 26 shown in FIG. 15).

In accordance with another aspect of the present invention, a picture encoding method is provided which includes predictive coding a picture for outputting first encoded data, locally decoding the first encoded data, predictive coding the picture using a locally decoded picture obtained as a result of local decoding to output second encoded data, and multiplexing the first encoded data and the second encoded data only with the motion vector used for obtaining the first encoded data.

In accordance with the above picture encoding device and picture encoding method, a picture is predictively encoded to output first encoded data, the first encoded data is locally decoded and the picture is predictively encoded using, as a reference picture, a locally decoded picture obtained on local decoding to output second encoded data. The first and second encoded data are multiplexed using only the motion vector used for obtaining the first encoded data.

In accordance with another aspect of the present invention, a picture decoding device for decoding encoded data is provided which includes separating means for separating first and second data from the encoded data (such as a demultiplexer 91
shown in FIG. 29), first decoding means for decoding the first data (such as the lower layer decoding unit 95 shown in FIG. 29), and second decoding means for decoding the second data using an output of the first decoding means as a reference picture (such as the upper layer decoding unit 93 shown in FIG. 29). The encoded data includes only the motion vector used in predictive coding the first data. The second decoding means decodes the second data in accordance with the motion vector used in predictive coding the first data.

In accordance with another aspect of the present invention, a picture decoding device for decoding encoded data is provided which includes separating means for separating first and second data from the encoded data (such as the demultiplexer 91
shown in FIG. 29), first decoding means for decoding the first data (such as the lower layer decoding unit 95 shown in FIG. 29), and second decoding means for decoding the second data using an output of the first decoding means as a reference picture (such as the upper layer decoding unit 93 shown in FIG. 29). If the encoded data includes only the motion vector used in predictive coding the first data, the second decoding means is caused to decode the second data in accordance with the motion vector used in predictive coding the first data.

In accordance with the above picture decoding device and picture decoding method, the first decoding means decodes the first data and the second decoding means decodes the second data using an output of the first decoding means as a reference picture. If the encoded data includes only the motion vector used in predictive coding the first data; the second decoding means decodes the second data in accordance with the motion vector used in predictive coding the first data.

In accordance with another aspect of the present invention, a recording medium is provided which has recorded thereon encoded data which is obtained on predictive coding a picture for outputting first encoded data, locally decoding the first encoded data, predictive coding the picture using a locally decoded picture obtained as a result of local decoding to output second encoded data, and multiplexing the first encoded data and the second encoded data only with the motion vector used for obtaining the first encoded data.

In accordance with another aspect of the present invention, a method for recording encoded data is provided in which the encoded data is obtained on predictive coding a picture and outputting first encoded data, locally decoding the first encoded data, predictive coding the picture using a locally decoded picture obtained as a result of local decoding to output second encoded data, and multiplexing the first encoded data and the second encoded data only with the motion vector used for obtaining the first encoded data.

In accordance with another aspect of the present invention, a picture encoding device is provided wherein whether or not a macro-block is a skip macro-block is determined based on reference picture information specifying a reference picture used in encoding a macro-block of a B-picture by one of forward predictive coding, backward predictive coding or bidirectionally predictive coding.

In accordance with another aspect of the present invention, a picture encoding method is provided wherein whether or not a macro-block is a skip macro-block is determined based on reference picture information specifying a reference picture used in encoding a macro-block of a B-picture by one of forward predictive coding, backward predictive coding or bidirectionally predictive coding.

In accordance with another aspect of the present invention, a picture decoding device is provided wherein whether or not a macro-block is a skip macro-block is determined based on reference picture information specifying a reference picture used in encoding a macro-block of a B-picture by one of the forward predictive coding, backward predictive coding, or bidirectionally predictive coding.

In accordance with another aspect of the present invention, a picture decoding method is provided wherein whether or not a macro-block is a skip macro-block is determined based on reference picture information specifying a reference picture used in encoding a macro-block of a B-picture by one of the forward predictive coding, backward predictive coding, or bidirectionally predictive coding.

In accordance with another aspect of the present invention, a recording medium having recorded thereon encoded data is provided wherein a macro-block is a skip macro-block based on reference picture information specifying a reference picture used in encoding a macro-block of a B-picture by one of forward predictive coding, backward predictive coding, or bidirectionally predictive coding.

In accordance with another aspect of the present invention, a recording method for recording encoded data is provided in which a macro-block is a skip macro-block based on reference picture information specifying a reference picture used in encoding a macro-block of a B-picture by one of forward predictive coding, backward predictive coding or bidirectionally predictive coding.

In accordance with another aspect of the present invention, a picture processing device is provided in which a pre-set table used for variable length encoding or variable length decoding is modified in keeping with changes in size of a picture.

In accordance with another aspect of the present invention, a picture processing method is provided in which it is judged whether or not a picture is changed in size and a pre-set table used for variable length encoding or variable length decoding is modified in keeping with changes in size of the picture.

In accordance with another aspect of the present invention, a picture processing device is provided in which a pre-set table used for variable length encoding or variable length decoding is modified according to whether or not a picture of a layer different from and a timing same as a layer of a picture being encoded has been used as a reference picture.

In accordance with another aspect of the present invention, a picture processing method is provided in which a pre-set table used for variable length encoding or variable length decoding is modified according to whether or not a picture of a layer different from and a timing same as a layer of a picture being encoded has been used as a reference picture.

In accordance with another aspect of the present invention, a picture encoding device is provided in which a pre-set quantization step is

quantized only if all of the results of quantization of pixel values in a pre-set block of a picture are not all of the same value.

The picture encoding device above for at least quantizing a picture by a pre-set quantization step includes multiplexing means for multiplexing the results of quantization of the picture and the pre-set quantization step (such as VLC unit 11
shown in FIGS. 22 and 23).

In accordance with another aspect of the present invention, a picture encoding method is provided in which a pre-set quantization step is quantized only if all of the results of quantization of pixel values in a pre-set block of a picture are not all of the same value.

In accordance with another aspect of the present invention, a picture decoding device for decoding encoded data is provided in which the encoded data contains a pre-set quantization step only if all of the results of quantization of pixel values in a pre-set block of a picture are not all of the same value.

In accordance with another aspect of the present invention, a picture decoding method for decoding encoding data is provided in which the encoded data contains a pre-set quantization step only if all of the results of quantization of pixel values in a pre-set block of a picture are not all of the same value.

In accordance with another aspect of the present invention, a recording medium having encoded data recorded thereon is provided in which the encoded data contains a pre-set quantization step only if all of the results of quantization of pixel values in a pre-set block of a picture are not all of the same value.

In accordance with another aspect of the present invention, a recording method for recording encoded data is provided in which the encoded data contains a pre-set quantization step only if all of the results of quantization of pixel values in a pre-set block of a picture are not all of the same value.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of a conventional encoder;

FIG. 2 is a diagram of a conventional decoder;

FIG. 3 is a diagram of an example of an encoder for carrying out conventional scalable encoding;

FIG. 4 is a diagram of an illustrative structure of a lower layer encoding unit 202 of FIG. 3;

FIG. 5 is a diagram of an illustrative structure of an upper layer encoding unit 202 of FIG. 3;

FIG. 6 is a diagram of an example of a decoder for carrying out conventional scalable decoding;

FIG. 7 is a diagram of an illustrative structure of a lower layer decoding unit 232 of FIG. 6;

FIG. 8 is a diagram of an illustrative structure of an upper layer decoding unit 231 of FIG. 6;

FIG. 9 is a diagram to which reference will be made in explaining a conventional picture synthesis method;

FIG. 10 is a diagram to which reference will be made in explaining an encoding method which enables picture re-editing and re-synthesis;

FIG. 11 is a diagram to which reference will be made in explaining a decoding method which enables picture re-editing and re-synthesis;

FIG. 12 is a diagram of an encoder according to an embodiment of the present invention;

FIG. 13 is a diagram to which reference will be made in explaining how the VO position and size are changed with time;

FIG. 14 is a diagram of an illustrative structure of VOP encoding units 3.sub.1 to 3.sub.N of FIG. 12;

FIG. 15 is a diagram of another illustrative structure of VOP encoding units 31 to 3N of FIG. 12;

FIGS. 16A and 16B are diagrams to which reference will be made in explaining spatial scalability;

FIGS. 17A and 17B are diagrams to which reference will be made in explaining spatial scalability;

FIGS. 18A and 18B are diagrams to which reference will be made in explaining spatial scalability;

FIGS. 19A and 19B are diagrams to which reference will be made in explaining spatial scalability;

FIGS. 20A and 20B are diagrams to which reference will be made in explaining a method for determining VOP size data and offset data;

FIGS. 21A and 21B are diagrams to which reference will be made in explaining a method for determining VOP size data and offset data;

FIG. 22 is a diagram of a lower layer encoding unit 25 of FIG. 15;

FIG. 23 is a diagram of a lower layer encoding unit 23 of FIG. 15;

FIGS. 24A and 24B are diagrams to which reference will be made in explaining spatial scalability;

FIGS. 25A and 25B are diagrams to which reference will be made in explaining spatial scalability;

FIGS. 26A and 26B illustrate referential select code (ref.sub.-- select.sub.-- code);

FIG. 27 is a diagram of a decoder according to an embodiment of the present invention;

FIG. 28 is a diagram of VOP decoding units 721 to 72N;

FIG. 29 is a diagram of another illustrative structure of VOP decoding units 721 to 72N;

FIG. 30 is a diagram of a lower layer decoding unit 95 of FIG. 29;

FIG. 31 is a diagram of an upper layer decoding unit 93 of FIG. 29;

FIG. 32 illustrates syntax of a bitstream obtained on scalable encoding;

FIG. 33 illustrates VS syntax;

FIG. 34 illustrates VO syntax;

FIG. 35 illustrates VOL syntax;

FIG. 36 illustrates VOP syntax;

FIG. 37 illustrates VOP syntax;

FIG. 38 shows variable length code of diff.sub.-- size.sub.-- horizontal and diff.sub.-- size.sub.-- vertical;

FIG. 39 shows variable length code of diff.sub.-- VOP.sub.-- horizontal.sub.-- ref and diff.sub.-- VOP.sub.-- vertical.sub.-- ref;

FIGS. 40A and 40B illustrate macro-block syntax;

FIGS. 41A and 41B illustrate MODV variable length code;

FIG. 42 illustrates a macro-block;

FIGS. 43A and 43B show variable length code of MBTYPE;

FIG. 44 illustrates predictive coding by a direct mode;

FIG. 45 illustrates predictive coding of a B-PICTURE of an upper layer;

FIGS. 46A and 46B are diagrams to which reference will be made in explaining a quasi-direct mode;

FIG. 47 is a flowchart to which reference will be made in explaining a method for determining a variable length table used for a lower layer;

FIG. 48 is a flowchart to which reference will be made in explaining a method for determining a variable length table used for an upper layer;

FIG. 49 is a flowchart to which reference will be made in explaining processing for a skip macro-block of a lower layer;

FIG. 50 is a flowchart to which reference will be made in explaining processing for a skip macro-block of an upper layer;

FIGS. 51A to 51C illustrate processing for a skip macro-block; and

FIG. 52 is a flowchart to which reference will be made in explaining processing for the quantization step DQUANT.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 12 illustrates an encoder according to an embodiment of the present invention. In such encoder, picture data for encoding are entered to a VO (video object) constructing unit 1 which extracts an object of a picture supplied thereto to construct a VO. The VO constructing unit 1 may generate a key signal for each VO and may output the generated key signal along with the associated VO signal to VOP (video object plane) constructing units 2.sub.1 to 2.sub.N. That is, if N number of VOs (VO1 to VO#N) are constructed in the VO constructing unit 1, such N VOs are outputted to the VOP constructing units 2.sub.1 to 2.sub.N along with associated key signals. More specifically, the picture data for encoding may include the background F1, foreground F2, and a key signal K1. Further, assume that a synthesized picture can be generated therefrom by use of a chroma key. In this situation, the VO constructing unit 1 may output the foreground F2 as VO1 and the key signal K1 as the key signal for the VO1 to the VOP constructing unit 2.sub.1 ; and the VO constructing unit 1 may output the background F1 as V02 to the VOP constructing unit 2.sub.2. As for the background, a key signal may not be required and, as such, is not generated and outputted.

If the picture data for encoding contains no key signal, as for example if the picture data for encoding is a previously synthesized picture, the picture is divided in accordance with a pre-set algorithm for extracting one or more areas and for generating a key signal associated with the extracted area. The VO constructing unit 1 sets a sequence of the extracted area to VO, which sequence is outputted along with the generated key signal to the associated VOP constructing unit 2n, where n=1, 2, . . . , N.

The VOP constructing unit 2n constructs a VO plane (VOP) from the output of the VO constructing unit 1 such that the number of horizontal pixels and vertical pixels will each be equal to a predetermined multiple, such as that of 16. If a VOP is constructed, the VOP constructing unit 2.sub.n outputs the VOP along with a key signal for extracting picture data of an object portion contained in the VOP, such as luminance or chroma signals, to a VOP encoding unit 3.sub.n (where n=1, 2, . . . n). This key signal is supplied from the VO constructing unit 1, as described above. The VOP constructing unit 2.sub.n detects size data (VOP size) which represents the size (such as the longitudinal length and the transverse length) of a VOP, and offset data (VOP offset) which represents the position of the VOP in the frame (for example, coordinates with the left uppermost point of the frame as a point of origin) and also supplies such data to the VOP encoding unit 3.sub.n.

The VOP encoding unit 3.sub.n encodes an output of the VOP constructing unit 2.sub.n in accordance with a predetermined standard, such as a MPEG or H.263 standard, and outputs the resulting bitstream to a multiplexing unit 4. The multiplexing unit 4 multipexes the bitstreams from the VOP encoding units 3.sub.1 to 3.sub.N and transmits the resulting multiplexed data as a ground wave or via a satellite network, CATV network or similar transmission path 5, or records the multiplexed data in a recording medium 6 (such as a magnetic disc, magneto-optical disc, an optical disc, a magnetic tape or the like).

VO and VOP will now be further explained.

VO may be a sequence of respective objects making up a synthesized picture in case there is a sequence of pictures for synthesis, while VOP is a VO at a given time point. That is, if there is a synthesized picture F3 synthesized from pictures F1
and F2, the pictures F1 or F2 arrayed chronologically are each a VO, while the pictures F1 or F2 at a given time point are each a VOP. Therefore, a VO may be a set of VOPs of the same object at different time points.

If the picture F1 is the background and the picture F2 is the foreground, the synthesized picture F3 is obtained by synthesizing pictures F1 and F2 using a key signal for extracting the picture F2. In this situation, the VOP of the picture F2
includes not only picture data constituting the picture F2 (luminance and chroma signals) but also the associated key signals.

Although the sequence of picture frames (screen frame) may not be changed in size or position, the VO may be changed in size and/or position. That is, the VOPs making up the same VO may be changed with time in size and/or position. For example, FIG. 13 shows a synthesized picture made up of a picture F1 as the background and a picture F2 as the foreground. The picture F1 is a photographed landscape in which a sequence of the entire picture represents a VO (termed VO0) and the picture F2 is a walking person as photographed in which a sequence of a minimum rectangle encircling the person represents a VO (termed VO1). In this example, VO0 (which is a landscape) basically does not changed in position or size, as is a usual picture or screen frame. On the other hand, VO1 (which is a picture of a person) changes in size or position as he or she moves towards the front or back of the drawing. Therefore, although FIG. 13 shows VO0 and VO1 at the same time point, the position and size of the two may not necessarily be the same. As a result, the VOP encoding unit 3.sub.n (FIG. 12) provides in its output bitstream not only data of the encoded VOP but also information pertaining to the positions (coordinates) and size of the VOP in a pre-set absolute coordinate system. FIG. 13 illustrates a vector OSTO which specifies the position of VO0 (VOP) at a given time point and a vector OST1 which specifies the position of VO1 (VOP) at the same time point.

FIG. 14 illustrates a basic structure of the VOP encoding unit 3.sub.n of FIG. 12. As shown in FIG. 14, the picture signal (picture data) from the VOP constructing unit 2.sub.n (luminance signals and chroma signals making up a VOP) is supplied to a picture signal encoding unit 11, which may be similarly constructed to the above encoder of FIG. 1, wherein the VOP is encoded in accordance with a system conforming to the MPEG or H.263 standard. Motion and texture information, obtained on encoding the VOP by the picture signal encoding unit 11, is supplied to a multiplexer 13. As further shown in FIG. 14, the key signal from the VOP constructing unit 2.sub.n is supplied to a key signal encoding unit 12 where it is encoded by, for example, differential pulse code modulation (DPCM). The key signal information obtained from the encoding by the key signal encoding unit 12 is also supplied to the multiplexer 13. In addition to the outputs of the picture signal encoding unit 11 and the key signal encoding unit 12, the multiplexer 13 also requires size data (VOP size) and offset data (VOP offset) from the VOP constructing unit 2.sub.n. The multiplexer 13 multiplexes the received data and outputs multiplexed data to a buffer 14
which transiently stores such output data and smooths the data volume so as to output smoothed data.

The key signal encoding unit 12 may perform not only DPCM but also motion compensation of the key signal in accordance with a motion vector detected by, for example, predictive coding carried out by the picture signal encoding unit 11 in order to calculate a difference from the key signal temporally before or after the motion compensation for encoding the key signal. Further, the data volume of the encoding result of the key signal in the key signal encoding unit 12 (buffer feedback) can be supplied to the picture signal encoding unit 11. A quantization step may be determined in the picture signal encoding unit 11 from such received data volume.

FIG. 15 illustrates a structure of the VOP encoding unit 3.sub.n of FIG. 12 which is configured for realization of scalability. As shown in FIG. 15, the VOP picture data from the VOP constructing unit 2.sub.n, its key signal, size data (VOP size) and offset data (VOP offset) are all supplied to a picture layering unit 21 which generates picture data of plural layers, that is, layers the VOPs. More specifically, in encoding the spatial scalability, the picture layering unit 21 may output the picture data and the key signal supplied thereto directly as picture data and key signals of an upper layer (upper order hierarchy) while thinning out pixels constituting the picture data and the key signals for lowering resolution in order to output the resulting picture data and the key signals of a lower layer (lower hierarchical order). The input VOP may also be lower layer data, while its resolution may be raised (its number of pixels may be increased) so as to be upper layer data.

A further description of the above-mentioned scalability operation will be provided. In this description, only two layers are utilized and described, although the number of layers may be three or more.

In the case of encoding of temporal scalability, the picture layering unit 21 may output the picture signals and the key signals alternately as upper layer data or lower layer data depending on time points. If the VOPs making up a VO are entered in the sequence of VOP0, VOP1, VOP2, VOP3, . .

. , to the picture layering unit 21, the latter outputs the VOPs VOP0, VOP2, VOP4, VOP6, . . . , as lower layer data, while outputting VOPs VOP1, VOP3, VOP5, VOP7, . . . , as upper layer data. In temporal scalability, simply the thinned-out VOPs may be lower layer data and upper layer data, while picture data are not enlarged nor contracted, that is, resolution conversion is not performed, although such resolution conversion can be performed.

In the case of using encoding SNR (signal to noise ratio) scalability, input picture signals and key signals are directly outputted as upper layer data or lower layer data. That is, in this case, the input picture signals and key signals of the upper and lower layers may be the same data.

The following three types of spatial scalability may occur in the case of encoding on a VOP basis.

If a synthesized picture made up of the pictures F1 and F2 shown in FIG. 13 is supplied as VOP, the first spatial scalability is to turn the input VOP in its entirety into an upper layer (enhancement layer) as shown in FIG. 16A, while turning the VOP contracted in its entirety to a lower layer (base layer) as shown in FIG. 16B.

The second spatial scalability is to extract an object constituting a portion of the input VOP corresponding to a picture F2 and to turn it into an upper layer as shown in FIG. 17A, while turning the VOP in its entirety into a lower layer (base layer) as shown in FIG. 17B. This extraction may be performed in the VOP constructing unit 2.sub.n so that an object extracted in this manner may be thought of as a VOP.

The third spatial scalability is to extract objects (VOPs) constituting the input VOP so as to generate an upper layer and a lower layer on a VOP basis, as shown in FIGS. 18A, 18B, 19A, and 19B. In FIGS. 18A and 18B, the upper and lower layers are generated from the background (picture F1) constituting the VOP of FIG. 13; while in FIGS. 19A and 19B, the upper and lower layers are generated from the foreground (picture F2) constituting the VOP of FIG. 13.

A desired type of spatial scalability may be selected or pre-determined from among the above-described three types, such that the picture layering unit 21 layers the VOPs for enabling the encoding by the pre-set scalability.

From the size data and offset data of the VOPs supplied to the picture layering unit 21 (sometimes referred to herein as initial size data and initial offset data, respectively), the picture layering unit 21 calculates (sets) offset data and size data specifying the position and size in a pre-set absolute coordinate system of the generated lower layer and upper layer VOPs, respectively.

The manner of setting the offset data (position information) and the size data of the upper and lower layers is explained with reference to the above-mentioned second scalability (FIGS. 17A and 17B). In this case, offset data FPOS.sub.-- B of the lower layer is set so that, if picture data of the lower layer is enlarged (interpolated) based on the resolution and difference in resolution from the upper layer, that is if the picture of the lower layer is enlarged with an enlarging ratio (multiplying factor FR), the offset data in the absolute coordinate system of the enlarged picture will be coincident with the initial offset data. The enlarging ratio is a reciprocal of the contraction ratio by which the upper layer picture is contracted to generate a picture of the lower layer. Similarly, size data FSZ.sub.-- B of the lower layer is set so that the size data of the enlarged picture obtained on enlarging the picture of the lower layer by the multiplying factor FR will be coincident with the initial size data. On the other hand, offset data FPOS.sub.-- E of the upper layer is set to a value of a coordinate such as, for example, that of the upper left apex of a 16-tupled minimum rectangle (VOP) surrounding an object extracted from the input VOP, as found based on the initial offset data, as shown in FIG. 20B. Additionally, size data FSZ.sub.-- E of the upper layer may be set to the transverse length and the longitudinal length of a 16-tupled minimum rectangle (VOP) surrounding an object extracted from the input VOP.

Therefore, if the offset data FPOS.sub.-- B and the size data FSZ.sub.-- B of the lower layer are converted in accordance with the multiplying factor FR, a picture frame of a size corresponding to the converted size data FSZ.sub.-- B may be thought of at a position corresponding to the converted offset data FPOS.sub.-- B in the absolute coordinate system, an enlarged picture obtained on multiplying the lower layer picture data by FR may be arranged as shown in FIG. 20A and the picture of the upper layer may be similarly arranged in accordance with the offset data FPOS.sub.-- E and size data FSZ.sub.-- E of the upper layer in the absolute coordinate system (FIG. 20B), in which associated pixels of the enlarged picture and of the upper layer picture are in a one-for-one relationship. That is, in this case, the person in the upper layer picture is at the same position as the person in the enlarged picture, as shown in FIGS. 20A and 20B.

In using the first and third types of scalability, the offset data FPOS.sub.-- B or FPOS.sub.-- E and size data FZS.sub.-- B and FZS.sub.-- E are determined so that associated pixels of the lower layer enlarged picture and the upper layer enlarged picture will be arranged at the same positions in the absolute coordinate system.

The offset data FPOS.sub.-- B, FPOS.sub.-- E and size data FZS.sub.-- B, FZS.sub.-- E may be determined as follows. That is, the offset data FPOS.sub.-- B of the lower layer may be determined so that the offset data of the enlarged picture of the lower layer will be coincident with a pre-set position in the absolute coordinate system such as the point of origin, as shown in FIG. 21A. On the other hand, the offset data FPOS.sub.-- E of the upper layer is set to a value of a coordinate, such as the upper left apex of a 16-tupled minimum rectangle (VOP) surrounding an object extracted from the input VOP as found based on the initial offset data, less the initial offset data, as shown for example in FIG. 21B. In FIGS. 21A and 21B, the size data FSZ.sub.-- B of the lower layer and the size data FZS.sub.-- E of the upper layer may be set in a manner similar to that explained with reference to FIGS. 20A and 20B.

When the offset data FPOS.sub.-- B and FPOS.sub.-- E are set as described above, associated pixels making up the enlarged picture of the lower layer and the picture of the upper layer are arrayed at the associated positions in the absolute coordinate system.

Returning to FIG. 15, picture data, key signals, offset data FPOS.sub.-- E, and size data FSZ.sub.-- E of the upper layer generated in the picture layering unit 21 are supplied to a delay circuit 22 so as to be delayed thereat by an amount corresponding to a processing time in a lower layer encoding unit 25 as later explained. Output signals from the delay circuit 22 are supplied to the upper layer encoding unit 23. The picture data, key signals, offset data FPOS.sub.-- B, and size data FSZ.sub.-- B of the lower layer are supplied to a lower layer encoding unit 25. The multiplying factor FR is supplied via the delay circuit 22 to the upper layer encoding unit 23 and to a resolution converter 24.

The lower layer encoding unit 25 encodes the picture data (second picture) and key signals of the lower layer. Offset data FPOS.sub.-- B and size data FSZ.sub.-- B are contained in the resulting encoded data (bitstream) which is supplied to a multiplexer 26. The lower layer encoding unit 25 locally decodes the encoded data and outputs the resulting locally decoded picture data of the lower layer to the resolution convertor 24. The resolution converter 24 enlarges or contracts the picture data of the lower layer received from the lower layer encoding unit 25 in accordance with the multiplying factor FR so as to revert the same to the original size. The resulting picture, which may be an enlarged picture, is outputted to the upper layer encoding unit 23.

The upper layer encoding unit 23 encodes picture data (first picture) and key signals of the upper layer. Offset data FPOS.sub.-- E and size data FSZ.sub.-- E are contained in the resulting encoded data (bitstream) which is supplied to the multiplexer 26. The upper layer encoding unit 23 encodes the picture data using the enlarged picture supplied from the resolution converter 24.

The lower layer encoding unit 25 and the upper layer encoding unit 23 are supplied with size data FSZ.sub.-- B, offset data FPOS.sub.-- B, a motion vector MV, and a flag COD. The upper layer encoding unit 23 refers to or utilizes such data or information as appropriate or needed during processing, as will be more fully hereinafter described.

The multiplexer 26 multiplexes the outputs from the upper layer encoding unit 23 and the lower layer encoding unit 25 and supplies therefrom the resulting multiplexed signal.

FIG. 22 illustrates an example of the lower layer encoding unit 25. In FIG. 22, parts or components corresponding to those in FIG. 1 are depicted by the same reference numerals. That is, the lower layer encoding unit 25 is similarly constructed to the encoder of FIG. 1 except for newly provided key signal encoding unit 43 and key signal decoding unit 44.

In the lower layer encoding unit 25 of FIG. 22, picture data from the layering unit 21 (FIG. 15), that is, VOPs of the lower layer, are supplied to and stored in a frame memory 31. A motion vector may then be detected on a macro-block basis in a motion vector detector 32. Such motion vector detector 32 receives the offset data FPOS.sub.-- B and the size data FSZ.sub.-- B of the lower-layer VOP, and detects the motion vector of the macro-block based on such data. Since the size and the position of the VOP change with time (frame), in detecting the motion vector, a coordinate system should be set as a reference for detection and the motion detected in the coordinate system. To this end, the above-mentioned absolute coordinate system may be used as a reference coordinate system for the motion vector detector 32 and the VOP for encoding and the VOP as the reference picture may be arranged in the absolute coordinate system for detecting the motion vector.

The motion vector detector 32 receives a decoded key signal from the key signal decoding unit 44 obtained by encoding the key signal of the lower layer and decoding the result of encoding. The motion vector detector 32 extracts a object from the VOP by utilizing the decoded key signal so as to detect the motion vector. The decoded key signal is used in place of the original key signal (key signal before encoding) for extracting the object because a decoded key signal is used on the receiving side.

Meanwhile, the detected motion vector (MV) is supplied along with the prediction mode not only to the VLC unit 36 and the motion compensator 42 but also to the upper layer encoding unit 23 (FIG. 15).

For motion compensation, the motion should be detected in the reference coordinate system in a manner similar to that described above. Thus, size data FSZ.sub.-- B and offset data FPOS.sub.-- B are supplied to the motion compensator 42, which also receives a decoded key signal from the key signal decoding unit 44 for the same reason as set forth in connection with the motion vector detector 32.

The VOP, the motion vector of which has been detected, is quantized as in FIG. 1 the resulting quantized data is supplied to the VLC unit 36. This VLC unit receives not only the quantized data, quantization step, motion vector and the prediction mode, but also the offset data FPOS.sub.-- B and size data FSZ.sub.-- B from the picture layering unit 21 (FIG. 15) so that this data may also be quantized. The VLC unit 36 also receives the encoded key signals from the key signal encoding unit 43
(bitstream of the key signal) so that the encoded key signals are also encoded with variable length encoding. That is, the key signal encoding unit 43 encodes the key signals from the picture layering unit 21 as explained with reference to FIG. 14. The encoded key signals are outputted to the VLC unit 36 and the key signal decoding unit 44. The key signal decoding unit 44 decodes the encoded key signals outputs the decoded key signal to the motion vector detector 32, the motion compensator 42, and the resolution converter 24 (FIG. 15).

The key signal encoding unit 43 is supplied not only with the key signals of the lower layer but also with the size data FSZ.sub.-- B and offset data FPOS.sub.-- B, so that, similarly to the motion vector detector 32, the key signal encoding unit
43 recognizes the position and the range of the key signals in the absolute coordinate system based on such data.

The VOP, the motion vector of which has been detected, is encoded as described above and locally decoded as in FIG. 1 for storage in a frame memory 41. The decoded picture may be used as a reference picture in a manner as described above and outputted to the resolution converter 24.

In distinction from the MPEG1 and 2, MPEG 4 may also use a B-picture as a reference picture, so that the B-picture is also locally decoded and stored in the frame memory 41. However, at the present time, the B-picture may be used as a reference picture only for the upper layer.

The VLC unit 36 checks the macro-blocks of the I-, P- and B-pictures as to whether or not these macro-blocks should be turned into skip macro-blocks, and sets flags COD and MODB in accordance with the results thereof. The flags COD and MODB are similarly variable length encoded for transmission. The flag COD is also supplied to the upper layer encoding unit 23.

FIG. 23 illustrates a structure of the upper layer encoding unit 23 of FIG. 15. In FIG. 23, parts or components corresponding to those shown in FIGS. 1 and 22 are depicted by the same reference numerals. That is, the upper layer encoding unit
23 is similarly constructed to the lower layer encoding unit 25 of FIG. 22 or to the encoder of FIG. 1 except for having a key signal encoding unit 51, a frame memory 52, and a key signal decoding unit 53 as new units.

In the upper layer encoding unit 23 of FIG. 15, picture data from the picture layering unit 21 (FIG. 15), that is the VOP of the upper layer, are supplied to the frame memory 31, as in FIG. 1, for detecting the motion vector on a macro-block basis in the motion vector detector 32. The motion vector detector 32 receives the VOP of the upper layer, size data FSZ.sub.-- E, and offset data FPOS.sub.-- E, in addition to the upper layer VOP, in a manner similar to that in FIG. 22, and receives the decoded key from the key signal decoder 53. The motion vector detector 32 recognizes the arraying position of the VOP of the upper layer in the absolute coordinate system based on the size data FSZ.sub.-- E and the offset data PPOS.sub.-- E, as in the above case, and extracts the object contained in the VOP based on the decoded key signals so as to detect the motion vector on a macro-block basis.

The motion vector detector 32 in the upper layer encoding unit 23 and in the lower layer encoding unit 25 processes the VOP in a pre-set sequence as explained with reference to FIG. 1. This sequence may be set as follows.

In the case of spatial scalability, the upper or lower layer VOP may be processed in the sequence of P, B, B, B, . . . , or I, P, P, P, . . . , as shown in FIGS. 24A or 24B, respectively. In the upper layer, the P-picture as the first VOP of the upper layer is encoded in this case using the VOP of the lower layer at the same time point, herein an I-picture, as a reference picture. The B-pictures, which are the second and following VOPs of the upper layer, are encoded using the directly previous VOP of the upper layer and the VOP of the lower layer at the same time point as the reference pictures. Similarly to the P-pictures of the lower layer, the B-pictures of the upper layer are used as reference pictures in encoding the other VOPs. The lower layer is encoded as in the case of MPEG1 or 2 or in H.263.

The SNR scalability may be consider as being equivalent to the spatial scalability wherein the multiplying factor FR is equal to unity, whereupon it may be treated in a manner similar to that of the spatial scalability described above.

In the case of using temporal scalability, that is, if the VO is made up of VOP0, VOP1, VOP2, VOP3, . . . with VOP1, VOP3, VOP5, V0P7, . . . being upper layers (FIG. 25A) and VOP0, VOP2, VOP4, VOP6, . . . being lower layers, (FIG. 25B), the VOPs of the upper and lower layers may be processed in the sequence of B, B, B, . . . or I, P, P, as shown in FIGS. 25A and 25B. In this case, the first VOP1 (B-picture) of the upper layer may be encoded using VOP0 (I-picture) and VOP2 (P-picture) of the lower layer as reference pictures. The second VOP3 (B-picture) of the upper layer may be encoded using the upper layer VOP1 just encoded as a

B-picture and VOP4 (P-picture) of the lower layer which is the picture at the next timing (frame) to the VOP3 as reference pictures. Similarly to VOP3, the third VOP5 of the upper layer (B-picture) may be encoded using VOP3 of the upper layer just encoded as the B-picture and also VOP6 (P-picture) of the lower layer which is the picture (frame) next in timing to the VOP5.

As described above, the VOP of the other layer, herein the lower layer (scalable layer) may be used as a reference picture for encoding. That is, if, for predictive coding an upper layer VOP, a VOP of the other layer is used as a reference picture (that is, a VOP of the lower layer is used as a reference picture for predictive encoding of a VOP of the upper layer), the motion vector detector 32 of the upper layer encoding unit 23 (FIG. 23) sets and outputs a flag specifying such use. For example, the flag (ref.sub.-- layer.sub.-- id) may specify a layer to which the VOP used as a reference picture belongs if there are three or more layers. Additionally, the motion vector detector 32 of the upper layer encoding unit 23 is adapted for setting and outputting a flag ref.sub.-- select.sub.-- code (reference picture information) in accordance with a flag ref.sub.-- layer.sub.-- id for the VOP. The flag ref.sub.-- select.sub.-- code specifies which layer VOP can be used as a reference picture in executing forward predictive coding or backward predictive coding.

FIGS. 26A and 26B specify values for a flag ref.sub.-- select.sub.-- code for a P- and B-picture.

As shown in FIG. 26A, if, for example, a P-picture of an upper layer (enhancement layer) is encoded using as a reference picture a VOP decoded (locally decoded) directly previously and which belongs to the same layer as the P-picture of the upper layer, the flag ref.sub.-- select.sub.-- code is set to `00`. Also, if a P-picture is encoded using as a reference picture a VOP displayed directly previously and which belongs to a layer different from the layer of the P-picture, the flag ref.sub.-- select.sub.-- code is set to `01`. If the P-picture is encoded using as a reference picture a VOP displayed directly subsequently and which belongs to a different layer, the flag ref.sub.-- select.sub.-- code is set to `10`. If the P-picture is encoded using as a reference picture a concurrent or coincident VOP belonging to a different layer, the flag ref.sub.-- select.sub.-- code is set to `11`.

As shown in FIG. 26B, on the other hand, if a B-picture of an upper layer, for example, is encoded using a concurrent VOP of a different layer as a reference picture for forward prediction or is encoded using a VOP decoded directly previously and which belongs to the same layer as a reference picture for backward prediction, the flag ref.sub.-- select.sub.-- code is set to `00`. Also, if a B-picture of an upper layer is encoded using a VOP belonging to the same layer as a reference picture for forward prediction or is encoded using a VOP displayed directly previously and which belongs to a different layer as a reference picture for backward prediction, the flag ref.sub.-- select.sub.-- code is set to `01`. In addition, if a B-picture of an upper layer is encoded using a VOP decoded directly previously and which belongs to the same layer as a reference picture or is encoded using a VOP displayed directly subsequently and which belongs to a different layer as a reference picture, the flag ref.sub.-- select.sub.-- code is set to `10`. Lastly, if a B-picture of an upper layer is encoded using a VOP displayed directly subsequently and which belongs to a different layer as a reference picture for forward prediction or is encoded using a VOP displayed directly subsequently and which belongs to a different layer as a reference picture for backward prediction, the flag ref.sub.-- select.sub.-- code is set to `11`.

The methods for predictive coding explained with reference to FIGS. 24A, 24B, 25A, and 25B are merely illustrative and, as is to be appreciated, it may be freely set within a range explained with reference to FIGS. 26A and 26B which VOP of which layer is to be used as a reference picture for forward predictive coding, backward predictive coding or bidirectional predictive coding.

In the above description, the terms `spatial scalability`, `temporal scalability` and `SNR scalability` were used for convenience. However, as explained with reference to FIGS. 26A and 26B, if a reference picture used for predictive encoding is set, that is if the syntax as shown in FIGS. 26A and 26B is used, it may be difficult to have a clear distinction of spatial scalability, temporal scalability and SNR scalability with the flag ref.sub.-- select.sub.-- code. Stated conversely, the above-mentioned scalability distinction need not be performed by using the flag ref.sub.-- select.sub.-- code. However, the scalability and the flag ref.sub.-- select.sub.-- code can, for example, be associated with each other as described below:

In the case of a P-picture, the flag ref.sub.-- select.sub.-- code of `11` is associated with the use as a reference picture (reference picture for forward prediction) of a concurrent VOP of a layer specified by the flag ref.sub.-- select.sub.-- code, wherein the scalability is spatial scalability or SNR scalability. If the flag ref.sub.-- select.sub.-- code is other than `11`, the scalability is temporal scalability.

In the case of a B-picture, the flag ref.sub.-- select.sub.-- code of `00` is associated with the use as a reference picture for forward prediction of a concurrent VOP of a layer specified by the flag ref.sub.-- select.sub.-- id, wherein the scalability is spatial scalability or SNR scalability. If the flag ref.sub.-- select.sub.-- code is other than `00`, the scalability is temporal scalability.

If a concurrent VOP of a different layer, herein a lower layer, is used as a reference picture for predictive coding of the VOP of the upper layer, there is no motion between the two VOPs, so that the motion vector is 0(0,0) at all times.

Returning to FIG. 23, the above-mentioned flags ref.sub.-- layer.sub.-- id and ref.sub.-- select.sub.-- code may be set in the motion detector 32 of the upper layer encoding unit 23 and supplied to the motion compensator 42 and the VLC unit 36. The motion vector detector 32 detects a motion vector by use not only of the frame memory 31 but also, if needed, a frame memory 52 in accordance with the flags ref.sub.-- layer.sub.-- id and ref.sub.-- select.sub.-- code. To the frame memory 52, a locally decoded enlarged picture of a lower layer may be supplied from the resolution converter 24 (FIG. 15). That is, the resolution converter 24 may enlarge the locally decoded VOP of the lower layer by, for example, an interpolation filter, so as to generate an enlarged picture corresponding to the VOP which is enlarged by a factor of FR that is an enlarged picture having the same size as the VOP of the upper layer associated with the VOP of the lower layer. The frame memory 52 stores therein the enlarged picture supplied from the resolution converter 24. However, if the multiplying factor is 1, the resolution converter 24 directly supplies the locally decoded VOP from the lower layer encoding unit 25 to the upper layer encoding unit 23 without performing any specified processing thereon.

The motion vector detector 32 receives size data FSZ.sub.-- B and offset data FPOS.sub.-- B from the lower layer encoding unit 25, and receives the multiplying factor FR from the delay circuit 22 (FIG. 15). Thus, if the enlarged picture stored in the frame memory 52 is used as a reference picture, that is, if a lower layer VOP concurrent with an upper layer VOP is used as a reference picture for predictive coding of the VOP of the upper layer, the motion vector detector 32 multiplies the size data FSZ.sub.-- B and the offset data FPOS.sub.-- B corresponding to the enlarged picture with the multiplying factor FR. In this case, the flag ref.sub.-- select.sub.-- code is set to `11` as explained with reference to FIG. 26A and to `00` for the P-picture and for the B-picture as explained with reference to FIG. 26B. The motion vector detector 32 recognizes the position of the enlarged picture in the absolute coordinate system based on the results of multiplication for detecting the motion vector.

The motion vector detector 32 may also receive a prediction mode and a motion vector of the lower layer. These may be used as follows. If the flag ref.sub.-- select.sub.-- code for the B-picture of the upper layer is `00`, and the multiplying factor FR is 1, that is if the scalability is SNR scalability, in which case an upper layer VOP is used for predictive coding of the upper layer so that the SNR scalability herein differs from that prescribed in MPEG2, the upper layer and the lower layer are of the same picture so that the motion vector and the predictive mode of the concurrent lower layer picture can be used directly for predictive coding of the B-picture of the upper layer. In this case, no motion vector nor prediction mode is outputted or transmitted from the motion vector detector 32 to the VLC unit 36 because the receiving side can recognize the prediction mode and the motion vector of the upper layer from the decoding results of the lower layer.

As described above, the motion vector detector 32 may use not only the VOP of an upper layer but also an enlarged picture as reference pictures for detecting the motion vector. In addition, the motion vector detector 32 may set the prediction mode which minimizes the prediction error or variance as explained with reference to FIG. 1. Furthermore, the motion vector detector 32 may also set and output other information, such as flag ref.sub.-- select.sub.-- code and/or ref.sub.-- layer.sub.-- id.

As shown in FIGS. 15 and 23, a flag COD specifying whether or not a macro-block constituting an I- or P-picture in the lower layer is a skip macro-block is supplied from the lower layer encoding unit 25 to the motion vector detector 32, VLC unit
36, and the motion compensator 42, as will be explained subsequently.

A macro-block, a motion vector thereof having been detected, may be encoded as described above, whereupon the VOL unit 36 outputs a variable length code as the encoding result. As in the lower layer encoding unit 25, the VLC unit 36 of the upper layer encoding unit 23 may set and output a flag COD specifying whether or not the I- or P-picture macro-block is a skip macro-block as described above and a flag MODB specifying whether the macro-block of the B-picture is a skip macro-block. The VLC unit 36 may also receive the multiplying factor FR, flags ref.sub.-- secret.sub.-- code and ref.sub.-- layer.sub.-- id, size data FSZ.sub.-- E, offset data FPOS.sub.-- E, and an output of the key signal encoding unit 51, in addition to the quantization coefficients, quantization step, motion vector, and the prediction mode. The VLC unit 36 variable-length encodes and outputs all of such data.

Further, the macro-bock, the motion vector of which has been detected, is encoded and locally decoded as described above and stored in the frame memory 41. In the motion compensator 42, motion compensation is carried out for so as to generate a prediction picture using not only the locally decoded VOP of the upper layer stored in the frame memory 41 but also the locally decoded and enlarged VOP of the lower layer stored in the frame memory 52. That is, the motion compensator 42 receives not only the motion vector and the prediction