U.S. patent number RE39,080 [Application Number 10/218,232] was granted by the patent office on 2006-04-25 for rate loop processor for perceptual encoder/decoder.
This patent grant is currently assigned to Lucent Technologies Inc.. Invention is credited to James David Johnston.
United States Patent |
RE39,080 |
Johnston |
April 25, 2006 |
Rate loop processor for perceptual encoder/decoder
Abstract
A method and apparatus for quantizing audio signals is disclosed
which advantageously produces a quantized audio signal which can be
encoded within an acceptable range. Advantageously, the quantizer
uses a scale factor which is interpolated between a threshold based
on the calculated threshold of hearing at a given frequency and the
absolute threshold of hearing at the same frequency.
Inventors: |
Johnston; James David (Redmond,
WA) |
Assignee: |
Lucent Technologies Inc.
(Murray Hill, NJ)
|
Family
ID: |
25293693 |
Appl.
No.: |
10/218,232 |
Filed: |
August 13, 2002 |
Related U.S. Patent Documents
|
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
Issue Date |
|
|
07844811 |
Mar 2, 1992 |
|
|
|
|
07844967 |
Feb 28, 1992 |
|
|
|
|
07292598 |
Dec 30, 1988 |
|
|
|
Reissue of: |
08310898 |
Sep 22, 1994 |
05627938 |
May 6, 1997 |
|
|
Current U.S.
Class: |
704/200.1;
704/229 |
Current CPC
Class: |
H04B
1/665 (20130101) |
Current International
Class: |
G10L
19/02 (20060101); G10L 21/04 (20060101) |
Field of
Search: |
;704/200.1,203,205,206,207,209,219,222,229,230 |
References Cited
[Referenced By]
U.S. Patent Documents
Foreign Patent Documents
|
|
|
|
|
|
|
0 193 143 |
|
Feb 1986 |
|
EP |
|
0193143 |
|
Sep 1986 |
|
EP |
|
0 376 553 |
|
Dec 1989 |
|
EP |
|
0376553 |
|
Jul 1990 |
|
EP |
|
0 559 383 |
|
Feb 1993 |
|
EP |
|
0559383 |
|
Sep 1993 |
|
EP |
|
0 446 037 |
|
Oct 1997 |
|
EP |
|
63-7023 |
|
Jan 1988 |
|
JP |
|
2796673 |
|
Mar 1989 |
|
JP |
|
H01(1989)-500695 |
|
Mar 1989 |
|
JP |
|
2792853 |
|
Sep 1998 |
|
JP |
|
Other References
Zwicker, "Psychoakustic," 1982, pp 31-53. cited by examiner .
K. Brandenburg, "OCF--A New Coding Algorithm For High Quality Sound
Signals", IEEE ICASSP, pp. 141-144, 1987. cited by other .
J. Princen et al., "Subband Transform Coding Using Filter Bank
Designs Based on Time Domain Aliasing Cancellation", IEEE ICASSP,
pp. 2161-2164, 1987. cited by other .
J. Princen, et al., "Analysis/Synthesis Filter Bank Design Based on
Time Domain Aliasing Cancellation", IEEE ICASSP, vol. AASP-34, No.
5, pp. 1153-1161, 1986. cited by other .
D.E. Knuth, et al., "The Art of Computer Programming", 2.sup.nd
Ed., vol. 2, Reading, Mass, pp. 274-275, 1981. cited by other .
Jetzt, "Critical Distance Measurements on Rooms From the Sound
Energy Spectrum Response", Journal of the Acoustical Society of
America, vol. 65, pp. 1204-1211, 1979. cited by other .
B. Scharf, Foundations of Modern Auditory Theory, edited by Jerry
V. Tobias, Chapter 5, Academic Press, N.Y., N.Y., pp. 157-202,
1970. cited by other .
R. P. Hellman, "Asymmetry of Masking Between Noise and Tone",
Perception and Psychophysics II, pp. 241-246, 1972. cited by other
.
A. Fletcher, "Auditory Patterns", Reviews of Modern Physics, vol.
12, pp. 47-65, 1940. cited by other .
E. F. Schroeder, et al., "MSC: Stereo Audio Coding With CD-Quality
and 256 kBIT/SEC", IEEE Transactions on Consumer Electronics, vol.,
CE-33, No. 4, pp. 512-519, Nov. 1987. cited by other .
J.D. Johnston, "Transform Coding of Audio Signals Using Perceptual
Noise Criteria", IEEE Journal On Selected Areas In Communications,
vol. 6, No. 2, pp. 314-323, Feb. 1988. cited by other .
N.S. Jayant, et al., "Digital Coding of Waveforms--Principals and
Applications to Speech and Video", Chapter 12, Transform Coding,
1987. cited by other .
E. Tan, et al., "Digital Audio Tape for Data Storage", IEEE
Spectrum, pp. 34-38, Oct. 1989. cited by other .
M.R. Schroeder, et al., "Optimizing Digital Speech Coders by
Exploiting Masking Properties of the Human Ear", Journal of
Acoustical Society of America, vol. 66 (6), pp. 1647-1652, Dec.
1979. cited by other .
"FX/FORTRAN Programmer's Handbook", Alliant Computer Systems Corp.,
Jul. 1988. cited by other .
R.N.J. Veldhhuis, et al., "Subband Coding of Digital Audio
Signals", Phillips Journal of Research, vol. 44, Nos. 2, 3, pp.
329-343, Jul. 1989. cited by other .
H.G. Musmann, "The ISO Audio Coding Standard", Globe-com '90, vol.
1(3), Dec. 1990, N.Y., pp. 511-517. cited by other .
"Aspec: Adaptive Spectral Entropy Coding of High Quality Music
Signals", AES 90.sup.th Convention, 1991. cited by other .
K. Brandenburg, "Second Generation Perpetual Audio Coding: The
Hybrid Coder", AES 89.sup.th Convention, 1990. cited by other .
J.D. Johnston, "Perceptual Transform Coding of Wideband Stereo
Signals", IEEE ICASSP, pp. 1993-1996, 1989. cited by other .
J.D. Johnston, "Estimation of Perceptul Entropy Using Noise Masking
Criteria", IEEE UCASSP, pp. 2524-2527, 1989. cited by other .
K. Brandenburg, "Evaluation Of Quality For Audio Encoding At Low
Bit Rates", AES 82.sup.nd Convention, pp. 1-11, Mar., 1987. cited
by other .
E. Zwicker, et al., Absolute and Masked Thresholds of Continuous
Sounds, pp. 65-81 of "The Ear As A Communication Receiver"
(Original German edition "Das Ohr als Nachrichtenempfanger", Second
Rev. ed. 1967). cited by other .
G. Theile, et al., "Low Bit-Rate Coding of High-Quality Audio
Signals" AES 82.sup.nd Convention, pp. 1-31, Mar., 1987. cited by
other .
G. Theile, et al., "Low Bit-Rate Coding of High-Quality Audio
Signals An Introduction to the MASCAM System" AES 7.sup.th
International Conf., EBU Review--Technical, No. 230, pp. 158-209,
Aug., 1988. cited by other .
R. G. van der Waal, et al., "Subband Coding of Sterophonic Digital
Audio Signals", IEEE, pp. 3601-3604, Jul., 1991. cited by other
.
K. Brandenburg, "Aspec Coding", AES 10.sup.th International Conf.,
pp. 81-90, Sep. 1991. cited by other .
Brandenburg et al., "A Digital Signal Processor for Real Time
Adaptive Transform Coding of Audio Signals Up To 20 kHz Bandwidth,"
IEEE Int'l Conf on Circuits and Computers, 1982, pp. 474-477. cited
by other .
Krahe, "Neues Quellencodierungsverfahren fur qualitativ
hochwertige, digital Audiosignale," University Duisburg, Nov. 1985.
cited by other .
Press et al., "Numerical Recipes," Cambridge University Press,
1986, pp. 77-92, 240-247 and 595. cited by other .
Tribolet et al., "Frequency Domain Coding of Speech," IEEE Trans.
on Acoust., Speech and Sig. Proc., Oct. 1979, pp. 512-530. cited by
other .
AT&T Bell Laboratories et al., "ASPEC," ISO-MPEG Audio Coding
Submission, submitted Oct. 18, 1989, amended Dec. 11, 1989, revised
Jun. 18, 1990. cited by other .
Johnston, "Transform Coding of AUdio Signals Using Perceptual Noise
Criteria," IEEE J. Selected Areas in Comm., vol. 6, No. 2, Feb.
1988, pp. 314-323. cited by other .
Brandenburg, K.-H., Langenbucher, G.C., Shram, H., Seitzer, D.; A
Digital Signal Processor For Real Time Adaptive Transform Coding Of
Audio Signals Up To 20 KHZ Bandwith, IEEE Int'l Conf. On Circuits
and Computers, 1982, pp. 474-477. cited by other .
Crochiere, R.E., Tribolet, J.M.; Frequency Domain Techniques For
Speech Coding, J. Acoust. Soc. Of Amer., Dec. 1979, pp. 1642-1646.
cited by other .
Crochiere, R.E.; Sub-Band Coding, The Bell System Technical
Journal, vol. 60, No. 7, Sep. 1981, pp. 1633-1653. cited by other
.
Flanagan, J.L., Schroeder, M.R., Atal, B.S., Crochiere, R.E.,
Jayant, N.S., Tribolet. J.M.; Speech Coding, IEEE Transactions on
Communications, vol. COM-27, No. 4, Apr. 1979, pp. 710-736. cited
by other .
Grauel, C.; Sub-Band Coding With Adaptive Bit Allocation, Signal
Processing 2, North-Holland Publishing Co., 1980, pp. 23-30. cited
by other .
Heron, C.D., Crochiere, R.E., Cox, R.V.; A 32-Band
Sub-band/Transform Coder Incorporatng Vector Quantization For
Dynamic Bit Allocation, Proceedings IEEE ICASSP, 1983, pp.
1276-1279. cited by other .
Jayant, N.S., Noll, P.; Digital Coding of Waveforms, Prentice-Hall,
1984, pp. 56-58. cited by other .
Krahe, D.; New Source Coding Method For High Quality Digital Audio
Signals, University Duisburg Nov., 1985 (Original and English
Translation). cited by other .
Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.;
Numerical Recipies Cambridge University Press, 1986, pp. 77-92,
240-247 and 595. cited by other .
Ramstad, T.A., Sub-band Coder With A Simple Adaptive Bit-Allocation
Algorithm: A Possible Candidate for Digital Mobile Telephony?
Proceedings IEEE ICASSP, 1982, vol. 1, pp. 203-207. cited by other
.
Tribolet, J.M., Crochiere, R. E., Frequency Domain Coding of
Speech, IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. ASSP-27, No. 5, Oct. 1979, pp. 512-530. cited by
other .
Zelinski, R., Noll, P.; Adaptive Transform Coding of Speech
Signals, IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. ASSP-25, No. 4, Aug. 1977, pp. 299-309. cited by
other .
Zwicker, E., Terhardt, E.; Facts And Models In Hearing,
Springer-Verlag 1974, pp. 251-257. cited by other .
Dolby's Preliminary Invalidity Contentions, Dolby Laboratories Inc.
and Dolby Laboratories Licensing Corporation v. Lucent Technologies
Inc. and Lucent Technologies Guardian I LLC, United States District
Court, Northern District of California, San Jose Division, Apr. 3,
2003. cited by other .
Brandenburg, K., et al., "Transmission of High Quality Audio
Signals with Bit Rates in the Range of 64-144 KBIT/SEC,"
ITG-Conference Proceedings, Information Technology Society, pp.
217-222, VDE-Verlag GmbH, Berlin, Nov., 1988. cited by other .
Brandenburg, K., "A Contribution on the Methods of and Evaluation
of the Quality of High-quality Music Coding," The Department of
Engineering at the University of Erlangen-Nuremberg, Doctor of
Engineering Thesis, pp. 1-199, Erlangen University Library,
Erlangen, Germany, Jan., 1989. cited by other .
Brandenburg, K., "High quality sound coding at 2.5 bit/sample,"
84.sup.th Convention, AES (Audio Engineering Society) Mar. 1-4,
1988, Paris, Preprint 2582 (D-2), 8 pages, Audio Engineering
Society, New York, 1988. cited by other .
Brandenburg, K., et al., "Low Bit Rate Coding of High-Quality
Digital Audio: Algorithms and Evaluation of Quality," AES
Conference, May 14-17, 1989, Toronto, Canada, pp. 1-25, Audio
Engineering Society, New York. cited by other .
Ernst Eberlein, et al., Psychoacoustically Based Measuring Device
For Optimizing Data Reduction Methods, Tonmelstertagung '88, pp.
552-564, Nov. 19, 1988. cited by other .
Brandenburg, K., et al., "OCF: Coding High Quality Audio with Data
Rates of 64kbit/sec," AES Convention, Nov. 3-6, 1988, Los Angeles,
California, 12 pages, Audio Engineering Society, New York. cited by
other .
Seitzer, et al., "Digital Coding of High Quality Audio,"
Proceedings Advanced Computer Technology, Reliable Systems and
Applications, 5.sup.th Annual European Computer Conference,
Bologna, May 13-16, 1991, pp. 148-154, IEEE Computer Society Press,
Los Alamitos, California, 1991. cited by other .
Dolby's Supplemental Invalidity Contentions, Dolby Laboratories
Inc. and Dolby Laboratories Licencing Corporation vs. Lucent
Technologies Inc. and Lucent Technologies Guardian I LLC, United
States Distric Court, Northern District of California, San Jose
Division, Jan. 7, 2004. cited by other .
Krasner, M.A., Digital Encoding of Speech and Audio Signals Based
on the Perceptual Requirements of the Auditory System, MIT Lincoln
Laboratory, Technical Report 535, Jun. 18, 1979. cited by other
.
Krasner, M.A. Digital Encoding of Speech and Audio Signals Based on
the Perceptual Requirements of the Audiroty System, Submitted in
Partial Fulfillment of the Requirements for the Degree of Doctor of
Philosophy, Massachusetts Institute of Technology, May 4, 1979.
cited by other .
Terhardt, E., Stoll, G., Seewann, M., Algorithm For Extraction Of
Pitch And Pitch Salience From Complex Tonal Signals, J. Acoust.
Soc. Am., 71(3), Mar. 1982, pp. 679-688. cited by other.
|
Primary Examiner: Azad; Abul K.
Parent Case Text
This .Iadd.is a reissue .Iaddend.application .Iadd.of U.S. Pat. No.
5,627,938 filed Sep. 22, 1994 as application Ser. No. 08/310,898
which .Iaddend.is a continuation of application Ser. No.
07/844,811, filed on Mar. 2, 1992, now abandoned.Iadd., which is a
continuation-in-part of application Ser. No. 07/844,967 filed Feb.
28, 1992, now abandoned, which is a continuation of Ser. No.
07/292,598 filed Dec. 30, 1988 now abandoned.Iaddend..
Claims
I claim:
1. A method of coding an audio signal comprising: (a) converting a
time domain representation of the audio signal into a frequency
domain representation of the audio signal, the frequency domain
representation comprising a set of frequency coefficients; (b)
calculating a masking threshold based upon the set of frequency
coefficients; (c) using a rate loop processor in an iterative
fashion to determine a set of quantization step size coefficients
for use in encoding the set of frequency coefficients, said set of
quantization step size coefficients determined by using the masking
threshold and an absolute hearing threshold; and (d) coding the set
of frequency coefficients based upon the set of quantization step
size coefficients.
.[.2. The method of claim 1 wherein the set of frequency
coefficients are MDCT coefficients..].
3. The method of claim 1 wherein the using the rate loop processor
in the iterative fashion is discontinued when a cost, measured by
the number of bits necessary to code the set of frequency
coefficients, is within a predetermined range.
4. A decoder for decoding a set of frequency coefficients
representing an audio signal, the decoder comprising: (a) means for
receiving the set of coefficients, the set of frequency
coefficients having been encoded by: (1) converting a time domain
representation of the audio signal into a frequency domain
representation of the audio signal comprising the set of frequency
coefficients; (2) calculating a masking threshold based upon the
set of frequency coefficients; (3) using a rate loop processor in
an iterative fashion to determine a set of quantization step size
coefficients needed to encode the set of frequency coefficients,
said set of quantization step size coefficients determined by using
the masking threshold and an absolute hearing threshold; and (4)
coding the set of frequency coefficients based upon the set of
quantization step size coefficients; and (b) means for converting
the set of coefficients to a time domain signal.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS AND MATERIALS
The following U.S. patent applications filed concurrently with the
present application and assigned to the assignee of the present
application are related to the present application and each is
hereby incorporated herein as if set forth in its entirety: "A
METHOD AND APPARATUS FOR THE PERCEPTUAL CODING OF AUDIO SIGNALS,"
by A. Ferreira and J. D. Johnston, application Ser. No. 07/844,819,
now abandoned, which in turn was parent of application Ser. No.
08/334,889, allowed Jul. 11, 1996: "A METHOD AND APPARATUS FOR
CODING AUDIO SIGNALS BASED ON PERCEPTUAL MODEL," by J.D. Johnston,
application Ser. No. 07/844,804, now U.S. Pat. No. 5,285,498,
issued Feb. 8, 1994; and "AN ENTROPY CODER," by J.D. Johnston and
J.A. Reeds, application Ser. No. 07/844,809, now U.S. Pat. No.
5,227,788, issued Jul. 13, 1993.
FIELD OF THE INVENTION
The present invention relates to processing of signals, and more
particularly, to the efficient encoding and decoding of monophonic
and stereophonic audio signals, including signals representative of
voice and music for storage or transmission.
BACKGROUND OF THE INVENTION
Consumer, industrial, studio and laboratory products for storing,
processing and communicating high quality audio signals are in
great demand. For example, so-called compact disc ("CD") and
digital audio tape ("DAT") recordings for music have largely
replaced the long-popular phonograph record and cassette tape.
Likewise, recently available digital audio tape ("DAT") recording
promise to provide greater flexibility and high storage density for
high quality audio signals. See, also, Tan and Vermeulen, "Digital
audio tape for data storage", IEEE Spectrum, pp. 34-38 (October
1989). A demand is also arising for broadcast applications of
digital technology that offer CD-like quality.
While these emerging digital techniques are capable of producing
high quality signals, such performance is often achieved only at
the expense of considerable data storage capacity or transmission
bandwidth. Accordingly, much work has been done in an attempt to
compress high quality audio signals for storage and
transmission.
Most of the prior work directed to compressing signals for
transmission and storage has sought to reduce the redundancies that
the source of the signals places on the signal. Thus, such
techniques as ADPCM, sub-band coding and transform coding
described, e.g., in N. S. Jayant and P. Noll, "Digital Codin of
Waveforms," Prentice-Hall, Inc. 1984, have sought to eliminate
redundancies that otherwise would exist in the source signals.
In other approaches, the irrelevant information in source signals
is sought to be eliminated using techniques based on models of the
human perceptual system. Such techniques are described, e.g., in E.
F. Schroeder and J. J. Platte "MSC"; Stereo Audio Coding with
CD-Quality and 256 kBIT/SEC, "IEEE Trans. on Consumer Electronics,
Vol. CE-33, No. 4, November 1987; and Johnston, Transform Coding of
Audio Signals Using Noise Criteria, Vol. 6, No. 2, IEEE J.S.C.A.
(February 1988).
Perceptual coding, as described, e.g., in the Johnston paper
related to a technique for lowering required bitrates (or
reapportioning available bits) or total number of bits in
representing audio signals. In this form of coding, a masking
threshold for unwanted signals is identified as a function of
frequency of the desired signal. Then, inter alia, the coarseness
of quantizing used to represent a signal component of the desired
signal is selected such that the quantizing noise introduced by the
coding does not rise above the noise threshold, though it may be
quite near this threshold. The introduced noise is therefore masked
in the perception process. While traditional signal-to-noise ratios
for such perceptually coded signals may be relatively low, the
quality of these signals upon decoding, as perceived by a human
listener, is nevertheless high.
Brandenburg et al, U.S. Pat. No. 5,040,217, issued Aug. 13, 1991,
describes a system for efficiently coding and decoding high quality
audio signals using such perceptual consideration. In particular,
using a measure of the "noise-like" or "tone-like" quality of the
input signals, the embodiments described in the latter system
provides a very efficient coding for monophonic audio signals.
It is, of course, important that the coding techniques used to
compress audio signals do not themselves introduce offensive
components or artifacts. This is especially important when coding
stereophonic audio information where coded information
corresponding to one stereo channel, when decoded for reproduction,
can interfere or interact with coding information corresponding to
the other stereo channel. Implementation choices for coding two
stereo channels include so-called "dual-mono" coders using two
independent coders operating at fixed bit rates. By contrast,
"joint mono" coders use two monophonic coders but share one
combined bit rate, i.e., the bit rate for the two coders is
constrained to be less than or equal to a fixed rate, but
trade-offs can be made between the bit rates for individual coders.
"Joint stereo" coders are those that attempt to use interchannel
properties for the stereo pair for realizing additional coding
gain.
It has been found that the independent coding of the two channels
of a stereo pair, especially at low bit-rates, can lead to a number
of undesirable psychoacoustic artifacts. Among them are those
related to the localization of coding noise that does not match the
localization of the dynamically imaged signal. Thus the human
stereophonic perception process appears to add constraints to the
encoding process if such mismatched localization is to be avoided.
This finding is consistent with reports on binaural masking-level
differences that appear to exist, at least for low frequencies,
such that noise may be isolated spatially. Such binaural
masking-level differences are considered to unmask a noise
component that would be masked in a monophonic system. See, for
example, B.C.J. Morre, "An Introduction to the Psychology of
Hearing, Second Edition," especially chapter 5, Academic Press,
Orlando, Fla., 1982.
One technique for reducing psychoacoustic artifacts in the
stereophonic context employs the ISO-WG11-MPEG-Audio Psychoacoustic
II [ISO] Model. In this model, a second limit of signal-to-noise
ratio ("SNR") is applied to signal-to-noise ratios inside the
psychoacoustic model. However, such additional SNR constraints
typically require the expenditure of additional channel capacity or
(in storage applications) the use of additional storage capacity,
at low frequencies, while also degrading the monophonic performance
of the coding.
SUMMARY OF THE INVENTION
Limitations of the prior art are overcome and a technical advance
is made in a method and apparatus for coding a stereo pair of high
quality audio channels in accordance with aspects of the present
invention. Interchannel redundancy and irrelevancy are exploited to
achieve lower bit-rates while maintaining high quality reproduction
after decoding. While particularly appropriate to stereophonic
coding and decoding, the advantages of the present invention may
also be realized in conventional dual monophonic stereo coders.
An illustrative embodiment of the present invention employs a
filter bank architecture using a Modified Discrete Cosine Transform
(MDCT). In order to code the full range of signals that may be
presented to the system, the illustrative embodiment advantageously
uses both L/R (Left and Right) and M/S (Sum/Difference) coding,
switched in both frequency and time in a signal dependent fashion.
A new stereophonic noise masking model advantageously detects and
avoids binaural artifacts in the coded stereophonic signal.
Interchannel redundancy is exploited to provide enhanced
compression for without degrading audio quality.
The time behavior of both Right and Left audio channels is
advantageously accurately monitored and the results used to control
the temporal resolution of the coding process. Thus, in one aspect,
an illustrative embodiment of the present invention, provides
processing of input signals to terms of either a normal MDCT
window, or, when signal conditions indicate, shorter windows.
Further, dynamic switching between RIGHT/LEFT or SUM/DIFFERENCE
coding modes is provided both in time and frequency to control
unwanted binaural noise localization, to prevent the need for
overcoding of SUM/DIFFERENCE signals, and to maximize the global
coding gain.
A typical bitstream definition and rate control loop are described
which provide useful flexibility in forming the coder output.
Interchannel irrelevancies, are advantageously eliminated and
stereophonic noise masking improved, thereby to achieved improved
reproduced audio quality in jointly coded stereophonic pairs. The
rate control method used in an illustrative embodiment uses an
interpolation between absolute threshold and masking threshold for
signals below the rate-limit of the coder, and a threshold
elevation strategy under rate-limited conditions.
In accordance with an overall coder/decoder system aspect of the
present invention, it provides advantageously to employ an improved
Huffman-like entropy coder/decoder to further reduce the channel
bit rate requirements, or storage capacity for storage
applications. The noiseless compression method illustratively used
employs Huffman coding along with a frequency-partitioning scheme
to efficiently code the frequency samples for L,R,M and S, as may
be dictated by the perceptual threshold.
The present invention provides a mechanism for determining the
scale factors to be used in quantizing the audio signal (i.e., the
MDCT coefficients output from the analysis filter bank) by using an
approach different from the prior art, and while avoiding many of
the restriction and costs of prior quantizer/rate-loops. The audio
signals quantized pursuant to the present invention introduce less
noise and encode into fewer bits than the prior art.
These results are obtained in an illustrative embodiment of the
present invention whereby the utilized scale factor, is iteratively
derived by interpolating between a scale factor derived from a
calculated threshold of hearing at the frequency corresponding to
the frequency of the respective spectral coefficient to be
quantized and a scale factor derived from the absolute threshold of
hearing at said frequency until the quantized spectral coefficients
can be encoded within permissible limits.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 presents an illustrative prior art audio
communication/storage system of a type is which aspects of the
present invention find application, and provides improvement and
extension.
FIG. 2 presents an illustrative perceptual audio coder (PAC) in
which the advances and teachings of the present invention find
application, and provide improvement and extension.
FIG. 3 shows a representation of a useful masking level difference
factor used in threshold calculations.
FIG. 4 presents an illustrative analysis filter bank according to
an aspect of the present invention.
FIG. 5(a) through 5(e) illustrate the operation of various window
functions.
FIG. 6 is a flow chart illustrating window switching
functionality.
FIG. 7 is a block/flow diagram illustrating the overall processing
of input signals to derive the output bitstream.
FIG. 8 illustrates certain threshold variations.
FIG. 9 is a flow chart representation of certain bit allocation
functionality.
FIG. 10 shows bitstream organization.
FIGS. 11a through 11c illustrate certain Huffman coding
operations.
FIG. 12 shows operations at a decoder that are complementary to
those for an encoder.
FIG. 13 is a flowchart illustrating certain quantization operations
in accordance with an aspect of the present invention.
FIG. 14(a) through 14(g) are illustrative windows for use with the
filter bank of FIG. 4.
DETAILED DESCRIPTION
1. Overview
To simplify the present disclosure, the following patents, patent
applications and publications are hereby incorporated by reference
in the present disclosure as if fully set forth herein: U.S. Pat.
No. 5,040,217, issued Aug. 13, 1991 by K. Brandenburg et al, U.S.
patent application Ser. No. 07/292,598, entitled Perceptual Coding
of Audio Signals, filed Dec. 30, 1988; J. D. Johnston, Transform
Coding of Audio Signals Using Perceptual Noise Criteria, IEEE
Journal on Selected Areas in Communications, Vol. 6, No. 2 February
1988); International Patent Application (PCT) WO 88/01811, filed
Mar. 10, 1988; U.S. patent application Ser. No. 07/491,373,
entitled Hybrid Perceptual Coding, filed Mar. 9, 1990, Brandenburg
et al, Aspec: Adaptive Spectral Entropy Coding of High Quality
Music Signals, AES 90th Convention (1991); Johnston, J., Estimation
of Perceptual Entropy Using Noise Masking Criteria, ICASSP, (1988);
J. D. Johnston, Perceptual Transform Coding of Wideband Stereo
Signals, ICASSP (1989); E. F. Schroeder and J. J. Platte, "MSC":
Stereo Audio Coding with CD-Quality and 256 kBIT/SEC," IEEE Trans.
on Consumer Electronics, Vol. CE-33, No. 4, November 1987; and
Johnston, Transform Coding of Audio Signals Using Noise Criteria,
Vol. 6, No. 2, IEEE J.S.C.A. (February 1988).
For clarity of explanation, the illustrative embodiment of the
present invention is presented as comprising individual functional
blocks (including functional blocks labeled as "processors"). The
functions these blocks represent may be provided through the use of
either shared or dedicated hardware, including, but not limited to,
hardware capable of executing software. (Use of the term
"processor" should not be construed to refer exclusively to
hardware capable of executing software.) Illustrative embodiments
may comprise digital signal processor (DSP) hardware, such as the
AT&T DSP16 or DSP32C, and software performing the operations
discussed below. Very large scale integration (VLSI) hardware
embodiments of the present invention, as well as hybrid DSP/VLSI
embodiments, may also be provided.
FIG. 1 is an overall block diagram of a system useful for
incorporating an illustrative embodiment of the present invention.
At the level shown, the system of FIG. 1 illustrates systems known
in the prior art, but modifications, and extensions described
herein will make clear the contributions of the present invention.
In FIG. 1, an analog audio signal 101 is fed into a preprocessor
102 where it is sampled (typically at 48 KHz) and converted into a
digital pulse code modulation ("PCM") signal 103 (typically 16
bits) in standard fashion. The PCM signal 103 is fed into a
perceptual audio coder 104 ("PAC") which compresses the PCM signal
and outputs the compressed PAC signal to a communications
channel/storage medium 106. From the communications channel/storage
medium the compressed PAC signal (105) is fed into a perceptual
audio decoder 108 which decompresses the compressed PAC signal and
outputs a PCM signal 107 which is representative of the compressed
PAC signal 105. From the perceptual audio decoder, the PCM signal
108 is fed into a post-processor 110 which creates an analog
representation of the PCM signal 107.
An illustrative embodiment of the perceptual audio coder 104 is
shown in block diagram form in FIG. 2. As in the case of the system
illustrated in FIG. 1, the system of FIG. 2, without more, may
equally describe certain prior art systems, e.g., the system
disclosed in the Brandenburg, et al U.S. Pat. No. 5,040,217.
However, with the extensions and modifications described herein,
important new results are obtained. The perceptual audio coder of
FIG. 2 may advantageously be viewed as comprising an analysis
filter bank 202, a perceptual model processor 204, a
quantizer/rate-loop processor 206 and an entropy encoder 208.
The filter bank 202 in FIG. 2 advantageously transforms an input
audio signal in time/frequency in such manner as to provide both
some measure of signal processing gain (i.e. redundancy extraction)
and a mapping of the filter bank inputs in a way that is meaningful
in light of the human perceptual system. Advantageously the
well-known Modified Discrete Cosine Transform (MDCT) described,
e.g., in J. P. Princen and A. B. Bradley, "Analysis/Synthesis
Filter Bank Design Based on Time Domain Aliasing Cancellation,"
IEEE Trans. ASSP, Vol. 34, No. 5, October, 1986, may be adapted to
perform such transforming of the input signals.
Features of the MDCT that make it useful in the present context
include its critical sampling characteristic, i.e. for every n
samples into the filter bank, n samples are obtained from the
filter bank. Additionally, the MDCT typically provides
half-overlap, i.e. the transform length is exactly twice the length
of the number of samples, n, shifted into the filterbank. The
half-overlap provides a good method of dealing with the control of
noise injected independently into each filter tap as well as
providing a good analysis window frequency response. In addition,
in the absence of quantization, the MDCT provides exact
reconstruction of the input samples, subject only to a delay of an
integral number of samples.
One aspect in which the MDCT is advantageously modified for use in
connection with a highly efficient stereophonic audio coder is the
provision of the ability to switch the length of the analysis
window for signal sections which have strongly non-stationary
components in such a fashion that it retains the critically sampled
and exact reconstruction properties. The incorporated U.S. patent
application by Ferreira and Johnston, entitled, "A METHOD AND
APPARATUS FOR THE PERCEPTUAL CODING OF AUDIO SIGNALS," (referred to
hereinafter as the "filter bank application") filed of even date
with this application, describes a filter bank appropriate for
performing the functions of element 202 in FIG. 2.
The perceptual model processor 204 shown in FIG. 2 calculates an
estimate of the perceptual importance, noise masking properties, or
just noticeable noise floor of the various signal components in the
analysis bank. Signals representative of these quantities are then
provided to other system elements to provide improved control of
the filtering operations and organizing of the data to be sent to
the channel or storage medium. Rather than using the critical band
by critical band analysis described in J. D. Johnston, "Transform
Coding of Audio Signals Using Perceptual Noise Criteria," IEEE J.
on Selected Areas in Communications, February 1988, an illustrative
embodiment of the present invention advantageously uses finer
frequency resolution in the calculation of thresholds. Thus instead
of using an overall tonality metric as in the last-cited Johnston
paper, a tonality method based on that mentioned in K. Brandenburg
and J. D. Johnston, "Second Generation Perceptual Audio Coding: The
Hybrid Coder," AES 89th Convention, 1990 provides a tonality
estimate that varies over frequency, thus providing a better fit
for complex signals.
The psychoacoustic analysis performed in the perceptual model
processor 204 provides a noise threshold for the L (Left), R
(Right), M (Sum) and S (Difference) channels, as may be
appropriate, for both the normal MDCT window and the shorter
windows. Use of the shorter windows is advantageously controlled
entirely by the psychoacoustic model processor.
In operation, an illustrative embodiment of the perceptual model
processor 204 evaluates thresholds for the left and right channels,
denoted THR.sub.l and THR.sub.r. The two thresholds are then
compared in each of the illustrative 35 coder frequency partitions
(56 partitions in the case of an active window-switched block). In
each partition where the two thresholds vary between left and right
by less than some amount, typically 2 dB, the coder is switched
into M/S mode. That is, the left signal for that band of
frequencies is replaced by M=(L+R)/2, and the right signal is
replaced by S=(L-R)/2. The actual amount of difference that
triggers the last-mentioned substitution will vary with bitrate
constraints and other system parameters.
The same threshold calculation used for L and R thresholds is also
used for M and S thresholds, with the threshold calculated on the
actual M and S signals. First, the basic thresholds, denoted
BTHR.sub.m and MLD.sub.s are calculated. Then, the following steps
are used to calculate the stereo masking contribution of the M and
S signals.
1. An additional factor is calculated for each of the M and S
thresholds. This factor, called MLD.sub.m, and MLD.sub.s, is
calculated by multiplying the spread signal energy, (as derived,
e.g., in J. D. Johnston, "Transform Coding of Audio Signals Using
Perceptual Noise Criteria," IEEE J. on Selected Areas in
Communications, February 1988; K. Brandenburg and J. D. Johnston,
"Second Generation Perceptual Audio Coding: The Hybrid Coder," AES
89th Convention, 1990; and Brandenburg, et al U.S. Pat. No.
5,040,217) by a masking level difference factor shown
illustratively in FIG. 3. This calculates a second level of
detectability of noise across frequency in the M and S channels,
based on the masking level differences shown in various
sources.
2. The actual threshold for M (THR.sub.m) is calculated as
THR.sub.m=max(BTHR.sub.m, min(BTHR.sub.s,MLD.sub.s)) and the
threshold m=max(BTHR.sub.m, min(BTHR.sub.s,MLD.sub.s)) and the
threshold for S is calculated as
THR.sub.s=max(BTHR.sub.s,min(BTHR.sub.m, MLD.sub.m)).
In effect, the MLD signal substitutes for the BTHR signal in cases
where there is a chance of stereo unmasking. It is not necessary to
consider the issue of M and S threshold depression due to unequal L
and R thresholds, because of the fact that L and R thresholds are
known to be equal.
The quantizer/rate loop processor 206 used in the illustrative
coder of FIG. 2 takes the outputs from the analysis bank and the
perceptual model, and allocate bits, noise, and controls other
system parameters so as to meet the required bit rate for the given
application. In some example coders this may consist of nothing
more than quantization so that the just noticeable difference of
the perceptual model is never exceeded, with no (explicit)
attention to bit rate; in some coders this may be a complex set of
iteration loops that adjusts distortion and bitrate in order to
achieve a balance between bit rate and coding noise. Also desirably
performed by the rate loop processor 206, and described in the rate
loop application, is the function of receiving information from the
quantized analyzed signal and any requisite side information,
inserting synchronization and framing information. Again, these
same functions are broadly described in the incorporated
Brandenburg, et al, U.S. Pat. No. 5,040,217.
Entropy encoder 208 is used to achieve a further noiseless
compression in cooperation with the rate loop processor 206. In
particular, entropy encoder 208, in accordance with another aspect
of the present invention, advantageously receives inputs including
a quantized audio signal output from quantizer/rate loop 206,
performs a lossless encoding, on the quantized audio signal, and
outputs a compressed audio signal to the communications
channel/storage medium 106.
Illustrative entropy encoder 208 advantageously comprises a novel
variation of the minimum-redundancy Huffman coding technique to
encode each quantized audio signal. The Huffman codes are
described, e.g., in D. A. Huffman, "A Method for the Construction
of Minimum Redundancy Codes", Proc. IRE, 40: 1098-1101 (1952) and
T. M. Cover and J. A. Thomas, .us Elements of Information Theory,
pp. 92-101 (1991). The useful adaptations of the Huffman codes
advantageously used in the context of the coder of FIG. 2 are
described in more detail in the incorporated U.S. patent
application by by J. D. Johnston and J. Reeds (hereinafter the
"entropy coder application") filed of even date with the present
application and assigned to the assignee of this application. Those
skilled in the data communications arts will readily perceive how
to implement alternative embodiments of entropy encoder 208 using
other noiseless data compression techniques, including the
well-known Lempel-Ziv compression methods.
The use of each of the elements shown in FIG. 2 will be described
in greater detail in the context of the overall system
functionality; details of operation will be provided for the
perceptual model processor 204.
2.1. The Analysis Filter Bank
The analysis filter bank 202 of the perceptual audio coder 104
receives as input pulse code modulated ("PCM") digital audio
signals (typically 16-bit signals sampled at 48 KHz), and outputs a
representation of the input signal which identifies the individual
frequency components of the input signal. Specifically, an output
of the analysis filter bank 202 comprises a Modified Discrete
Cosine Transform ("MDCT") of the input signal. See, J. Princen et
al, "Sub-band Transform Coding Using Filter Bank Designs Based on
Time Domain Aliasing Cancellation," IEEE ICASSP, pp. 2161-2164
(1987).
An illustrative analysis filter bank 202 according to one aspect of
the present invention is presented in FIG. 4. Analysis filter bank
202 comprises an input signal buffer 302, a window multiplier 304,
a window memory 306, an FFT processor 308, an MDCT processor 310, a
concatenator 311, a delay memory 312 and a data selector 314.
The analysis filter bank 202 operates on frames. A frame is
conveniently chosen as the 2N PCM input audio signal samples held
by input signal buffer 302. As stated above, each PCM input audio
signal sample is represented by M bits. Illustratively, N=512 and
M=16.
Input signal buffer 302 comprises two sections: a first section
comprising N samples in buffer locations 1 to N, and a second
section comprising N samples in buffer locations N+1 to 2N. Each
frame to be coded by the perceptual audio coder 104 is defined by
shifting N consecutive samples of the input audio signal into the
input signal buffer 302. Older samples are located at higher buffer
locations than newer samples.
Assuming that, at a given time, the input signal buffer 302
contains a frame of 2N audio signal samples, the succeeding frame
is obtained by (1) shifting the N audio signal samples in buffer
locations 1 to N into buffer locations N+1 to 2N, respectively,
(the previous audio signal samples in location N+1 to 2N may be
either overwritten or deleted), and (2) by shifting into the input
signal buffer 302, at buffer locations 1 to N, N new audio signal
samples from preprocessor 102. Therefore, it can be seen that
consecutive frames contain N samples in common: the first of the
consecutive frames having the common samples in buffer locations 1
to N, and the second of the consecutive frames having the common
samples in buffer locations N+1 to 2N. Analysis filter bank 202 is
a critically sampled system (i.e., for every N audio signal samples
received by the input signal buffer 302, the analysis filter bank
202 outputs a vector of N scalers to the quantizer/rate-loop
206).
Each frame of the input audio signal is provided to the window
multiplier 304 by the input signal buffer 302 so that the window
multiplier 304 may apply seven distinct data windows to the
frame.
Each data window is a vector of scalers called "coefficients".
While all seven of the data windows have 2N coefficients (i.e., the
same number as there are audio signal samples in the frame), four
of the seven only have N/2 non-zero coefficients (i.e., one-fourth
the number of audio signal samples in the frame). As is discussed
below, the data window coefficients may be advantageously chosen to
reduce the perceptual entropy of the output of the MDCT processor
310.
The information for the data window coefficients is stored in the
window memory 306. The window memory 306 may illustratively
comprise a random access memory ("RAM"), read only memory ("ROM"),
or other magnetic or optical media. Drawings of seven illustrative
data windows, as applied by window multiplier 304, are presented in
FIG. 14. Typical vectors of coefficients for each of the seven data
windows presented in FIG. 14. As may be seen in FIG. 14, some of
the data window coefficients may be equal to zero.
Keeping in mind that the data window is a vector of 2N scalers and
that the audio signal frame is also a vector of 2N scalers, the
data window coefficients are applied to the audio signal frame
scalers through point-to-point multiplication (i.e., the first
audio signal frame scaler is multiplied by the first data window
coefficient, the second audio signal frame scaler is multiplied by
the second data window coefficient, etc.). Window multiplier 304
may therefore comprise seven microprocessors operating in parallel,
each performing 2N multiplications in order to apply one of the
seven data window to the audio signal frame held by the input
signal buffer 302. The output of the window multiplier 304 is seven
vectors of 2N scalers to be referred to as "windowed frame
vectors".
The seven windowed frame vectors are provided by window multiplier
304 to FFT processor 308. The FFT processor 308 performs an
odd-frequency FFT on each of the seven windowed frame vectors. The
odd-frequency FFT is an Discrete Fourier Transform evaluated at
frequencies: .times. ##EQU00001## where k=1,3,5, . . . ,2N, and
f.sub.N equals one half the sampling rate. The illustrative FFT
processor 308 may comprise seven conventional decimation-in-time
FFT processors operating in parallel, each operating on a different
windowed frame vector. An output of the FFT processor 308 is seven
vectors of 2N complex elements, to be referred to collectively as
"FFT vectors".
FFT processor 308 provides the seven FFT vectors to both the
perceptual model processor 204 and the MDCT processor 310. The
perceptual model processor 204 uses the FFT vectors to direct the
operation of the data selector 314 and the quantizer/rate-loop
processor 206. Details regarding the operation of data selector 314
and perceptual model processor 204 are presented below.
MDCT processor 310 performs an MDCT based on the real components of
each of the seven FFT vectors received from FFT processor 308. MDCT
processor 310 may comprise seven microprocessors operating in
parallel. Each such microprocessor determines one of the seven
"MDCT vectors" of N real scalars based on one of the seven
respective FFT vectors. For each FFT vector, F(k), the resulting
MDCT vector, X(k), is formed as follows:
.function..function..function..times..function..pi..function..times..time-
s..times..times..ltoreq..ltoreq. ##EQU00002## The procedure need
run k only to N, not 2N, because of redundancy in the result. To
wit, for N<k.ltoreq.2N: X(k)=-X(2N-k). MDCT processor 310
provides the seven MDCT vectors to concatenator 311 and delay
memory 312.
As discussed above with reference to window multiplier 304, four
the seven data windows have N/2 non-zero coefficients (see FIGS.
4c-f). This means that four of the windowed frame vectors contain
only N/2 non-zero values. Therefore, the non-zero values of these
four vectors may be concatenated into a single vector of length 2N
by concatenator 311 upon output from MDCT processor 310. The
resulting concatenation of these vectors is handled as a single
vector for subsequent purposes. Thus, delay memory 312 is presented
with four MDCT vectors, rather than seven.
Delay memory 312 receives the four MDCT vectors from MDCT processor
310 and concatenator 311 for the purpose of providing temporary
storage. Delay memory 312 provides a delay of one audio signal
frame (as defined by input signal buffer 302) on the flow of the
four MDCT vectors through the filter bank 202. The delay is
provided by (i) storing the two most recent consecutive sets of
MDCT vectors representing consecutive audio signal frames and (ii)
presenting as input to data selector 314 the older of the
consecutive sets of vectors. Delay memory 312 may comprise random
access memory (RAM) of size: M.times.2.times.4.times.N where 2 is
the number of consecutive sets of vectors, 4 is the number of
vectors in a set, N is the number of elements in an MDCT vector,
and M is the number of bits used to represent an MDCT vector
element.
Data selector 314 selects one of the four MDCT vectors provided by
delay memory 312 to be output from the filter bank 202 to
quantizer/rate-loop 206. As mentioned above, the perceptual model
processor 204 directs the operation of data selector 314 based on
the FFT vectors provided by the FFT processor 308. Due to the
operation of delay memory 312, the seven FFT vectors provided to
the perceptual model processor 204 and the four MDCT vectors
concurrently provided to data selector 314 are not based on the
same audio input frame, but rather on two consecutive input signal
frames--the MDCT vectors based on the earlier of the frames, and
the FFT vectors based on the later of the frames. Thus, the
selection of a specific MDCT vector is based on information
contained in the next successive audio signal frame. The criteria
according to which the perceptual model processor 204 directs the
selection of an MDCT vector is described in Section 2.2, below.
For purposes of an illustrative stereo embodiment, the above
analysis filterbank 202 is provided for each of the left and right
channels.
2.2. The Perceptual Model Processor
A perceptual coder achieves success in reducing the number of bits
required to accurately represent high quality audio signals, in
part, by introducing noise associated with quantization of
information bearing signals, such as the MDCT information from the
filter bank 202. The goal is, of course, to introduce this noise in
an imperceptible or benign way. This noise shaping is primarily a
frequency analysis instrument, so it is convenient to convert a
signal into a spectral representation (e.g., the MDCT vectors
provided by filter bank 202), compute the shape and amount of the
noise that will be masked by these signals and injecting it by
quantizing the spectral values. These and other basic operations
are represented in the structure of the perceptual coder shown in
FIG. 2.
The perceptual model processor 204 of the perceptual audio coder
104 illustratively receives its input from the analysis filter bank
202 which operates on successive frames. The perceptual model
processor inputs then typically comprise seven Fast Fourier
Transform (FFT) vectors from the analysis filter bank 202. These
are the outputs of the FFT processor 308 in the form of seven
vectors of 2N complex elements, each corresponding to one of the
windowed frame vectors.
In order to mask the quantization noise by the signal, one must
consider the spectral contents of the signal and the duration of a
particular spectral pattern of the signal. These two aspects are
related to masking in the frequency domain where signal and noise
are approximately steady state--given the integration period of the
hearing system- and also with masking in the time domain where
signal and nose are subjected to different cochlear filters. The
shape and length of these filters are frequency dependent.
Masking in the frequency domain is described by the concept of
simultaneous masking. Masking in the time domain is characterized
by the concept of premasking and postmasking. These concepts are
extensively explained in the literature; see, for example, E.
Zwicker and H. Fastl, "Psychoacoustics, Facts, and Models,"
Springer-Verlag, 1990. To make these concepts useful to perceptual
coding, they are embodied in different ways.
Simultaneous masking is evaluated by using perceptual noise shaping
models. Given the spectral contents of the signal and its
description in terms of noise-like or tone-like behavior, these
models produce an hypothetical masking threshold that rules the
quantization level of each spectral component. This noise shaping
represents the maximum amount of noise that may be introduced in
the original signal without causing any perceptible difference. A
measure called the PERCEPTUAL ENTROPY (PE) uses this hypothetical
masking threshold to estimate the theoretical lower bound of the
bitrate for transparent encoding. J. D. Jonston, Estimation of
Perceptual Entropy Using Noise Masking Criteria," ICASSP, 1989.
Premasking characterizes the (in)audibility of a noise that starts
some time before the masker signal which is louder than the noise.
The noise amplitude must be more attenuated as the delay increases.
This attenuation level is also frequency dependent. If the noise is
the quantization noise attenuated by the first half of the
synthesis window, experimental evidence indicates the maximum
acceptable delay to be about 1 millisecond.
This problem is very sensitive and can conflict directly with
achieving a good coding gain. Assuming stationary conditions--which
is a false premiss--The coding gain is bigger for larger
transforms, but, the quantization error spreads till the beginning
of the reconstructed time segment. So, if a transform length of
1024 points is used, with a digital signal sampled at a rate of
48000 Hz, the noise will appear at most 21 milliseconds before the
signal. This scenario is particularly critical when the signal
takes the form of a sharp transient in the time domain commonly
known as an "attack". In this case the quantization noise is
audible before the attack. The effect is known as pre-echo.
Thus, a fixed length filter bank is a not a good perceptual
solution nor a signal processing solution for non-stationary
regions of the signal. It will be shown later that a possible way
to circumvent this problem is to improve the temporal resolution of
the coder by reducing the analysis/synthesis window length. This is
implemented as a window switching mechanism when conditions of
attack are detected. In this way, the coding gain achieved by using
a long analysis/synthesis window will be affected only when such
detection occurs with a consequent need to switch to a shorter
analysis/synthesis window.
Postmasking characterizes the (in)audibility of a noise when it
remains after the cessation of a stronger masker signal. In this
case the acceptable delays are in the order of 20 milliseconds.
Given that the bigger transformed time segment lasts 21
milliseconds (1024 samples), no special care is needed to handle
this situation.
WINDOW SWITCHING
The PERCEPTUAL ENTROPY (PE)_measure of a particular transform
segment gives the theoretical lower bound of bits/sample to code
that segment transparently. Due to its memory properties, which are
related to premasking protection, this measure shows a significant
increase of the PE value to its previous value--related with the
previous segment--when some situations of strong non-stationarity
of the signal (e.g. an attack) are presented. This important
property is used to activate the window switching mechanism in
order to reduce pre-echo. This window switching mechanism is not a
new strategy, having been used, e.g., in the ASPEC coder, described
in the ISO/MPEG Audio Coding Report, 1990, but the decision
technique behind it is new using the PE information to accurately
localize the non-stationarity and define the right moment to
operate the switch.
Two basic window lengths: 1024 samples and 256 samples are used.
The former corresponds to a segment duration of about 21
milliseconds and the latter to a segment duration of about 5
milliseconds. Short windows are associated in sets of 4 to
represent as much spectral data as a large window (but they
represent a "different" number of temporal samples). In order to
make the transition from large to short windows and vice-versa it
proves convenient to use two more types of windows. A START window
makes the transition from large (regular) to short windows and a
STOP window makes the opposite transition, as shown in FIG. 5b. See
the above-cited Princen reference for useful information on this
subject. Both windows are 1024 samples wide. They are useful to
keep the system critically sampled and also to guarantee the time
aliasing cancellation process in the transition region.
In order to exploit interchannel redundancy and irrelevancy, the
same type of window is used for RIGHT and LEFT channels in each
segment.
The stationarity behavior of the signal is monitored at two levels.
First by large regular windows, then if necessary, by short
windows. Accordingly, the PE of large (regular) window is
calculated for every segment while the PE of short windows are
calculated only when needed. However, the tonality information for
both types is updated for every segment in order to follow the
continuous variation of the signal.
Unless stated otherwise, a segment involves 1024 samples which is
the length of a large regular window.
The diagram of FIG. 5a represents all the monitoring possibilities
when the segment from the point N/2 till the point 3N/2 is being
analysed. Related to the digram of FIG. 5 is the flowchart of FIG.
6 which describes the monitoring sequence and decision technique.
We need to keep in buffer three halves of a segment in order to be
able to insert a START window prior to a sequence of short windows
when necessary. FIGS. 5a-e explicitly considers the 50% overlap
between successive segments.
The process begins by analysing a "new" segment with 512 new
temporal samples (the remaining 512 samples belong to the previous
segment). As shown in FIG. 6, the PE of this new segment and the
differential PE to the previous segment are calculated (601). If
the latter value reaches a predefined threshold (602), then the
existence of a non-stationarity inside the current segment is
declared and details are obtained by processing four short windows
with positions as represented in FIG. 5a. The PE value of each
short window is calculated (603) resulting in the ordered sequence:
PE1, PE2, PE3 and PE4. From these values, the exact beginning of
the strong non-stationarity of the signal is deduced. Only five
locations are possible, identified in FIG. 5a as L1, L2, L3, L4 and
L5. As it will become evident, if the non-stationarity had occurred
somewhere from the point N/2 till the point 15N/16, that situation
would have been detected in the previous segment. It follows that
the PE1 value does not contain relevant information about the
stationarity of the current segment. The average PE of the short
windows is compared with the PE of the large window of the same
segment (605). A smaller PE reveals a more efficient coding
situation. Thus if the former value is not smaller than the latter,
then we assume that we are facing a degenerate situation and the
window switching process is aborted.
It has been observed that for short windows the information about
stationarity lies more on its PE value than on the differential to
the PE value of the precedent window. Accordingly, the first window
that has a PE value larger than a predefined threshold is detected.
PE2 is identified with location L1, PE3 with L2 and PE4 with
location L3. In either case, a START window (608) is placed before
the current segment that will be coded with short windows. A STOP
window is needed to complete the process (616). There are, however,
two possibilities. If the identified location where the strong
non-stationarity of the signal begins is L1 or L2 then, this is
well inside the short window sequence, no coding artifacts result
and the coding sequence is depicted in FIG. 5b. If the location if
L4 (612), then, in the worst situation, the non-stationarity may
begin very close to the right edge of the last short window.
Previous results have consistently shown that placing a STOP
window--in coding conditions--in these circumstances degrades
significantly the reconstruction of the signal in this switching
point. For this reason, another set of four short windows is placed
before a STOP window (614). The resulting coding sequence is
represented in FIG. 5e.
If none of the short PEs is above the threshold, the remaining
possibilities are L4 or L5. In this case, the problem lies ahead of
the scope of the short window sequence and the first segment in the
buffer may be immediately coded using a regular large window.
To identify the correct location, another short window must be
processed. It is represented in FIG. 5a by a dotted curve and its
PE value, PE1.sub.n+1, is also computed. As it is easily
recognized, this short window already belongs to the next segment.
If PE1.sub.n+1 is above the threshold (611), then, the location is
L4 and, as depicted in FIG. 5c, a START window (613) may be
followed by a STOP window (615). In this case the spread of the
quantization noise will be limited to the length of a short window,
and a better coding gain is achieved. In the rare situation of the
location being L5, then the coding is done according to the
sequence of FIG. 5d. The way to prove that in this case that is
right solution is by confirming that PE2.sub.n+1 will be above the
threshold. PE2.sub.n+1 is the PE of the short window (not
represented in FIG. 5) immediately following the window identified
with PE1.sub.n+1.
As mentioned before for each segment, RIGHT and LEFT channels use
the same type of analysis/synthesis window. This means that a
switch is done for both channels when at least one channel requires
it.
It has been observed that for low bitrate applications the solution
of FIG. 5c, although representing a good local psychoacoustic
solution, demands an unreasonably large number of bits that may
adversely affect the coding quality of subsequent segments. For
this reason, that coding solution may eventually be inhibited.
It is also evident that the details of the reconstructed signal
when short windows are used are closer to the original signal than
when only regular large window are used. This is so because the
attack is basically a wide bandwidth signal and may only be
considered stationary for very short periods of time. Since short
windows have a greater temporal resolution than large windows, they
are able to follow and reproduce with more fidelity the varying
pattern of the spectrum. In other words, this is the difference
between a more precise local (in time) quantization of the signal
and a global (in frequency) quantization of the signal.
The final masking threshold of the stereophonic coder is calculated
using a combination of monophonic and stereophonic thresholds.
While the monophonic threshold is computed independently for each
channel, the stereophonic one considers both channels.
The independent masking threshold for the RIGHT of the LEFT channel
is computed using a psychoacoustic model that includes an
expression for tone masking noise and noise masking tone. The
latter is used as a conservative approximation for a noise masking
noise expression. The monophonic threshold is calculated using the
same procedure as previous work. In particular, a tonality measure
considers the evolution of the power and the phase of each
frequency coefficient across the last three segments to identify
the signal as being more tone--like or noise--like. Accordingly,
each psychoacoustic expression is more of less weighted than the
other. These expressions found in the literature were updated for
better performance. They are defined as: .times. ##EQU00003##
.times. ##EQU00003.2## where bark is the frequency in Bark scale.
The scale is related to what we may call the cochlear filters or
critical bands which, in turn, are identified with constant length
segments of the basilar membrane. The final threshold is adjusted
to consider absolute thresholds of masking and also to consider a
partial premasking protection.
A brief description of the complete monophonic threshold
calculation follows. Some terminology must be introduced in order
to simplify the description of the operations involved.
The spectrum of each segment is organized in three different ways,
each one following a different purpose.
1. First, it may be organized in partitions. Each partition has
associated one single Bark value. These partitions provide a
resolution of approximately either one MDCT line or 1/3 of a
critical band, whichever is wider. At low frequencies a single line
of the MDCT will constitute a coder partition. At high frequencies,
many lines will be combined into one coder partition. In this case
the Bark value associated is the median Bark point of the
partition. This partitioning of the spectrum is necessary to insure
an acceptable resolution for the spreading function. As will be
shown later, this function represents the masking influence among
neighboring critical bands.
2. Secondly, the spectrum may be organized in bands. Bands are
defined by a parameter file. Each band groups a number of spectral
lines that are associated with a single scale factor that results
from the final masking threshold vector.
3. Finally, the spectrum may also be organized in sections. It will
be shown later that sections involve an integer number of bands and
represent a region of the spectrum coded with the same Huffman code
book.
Three indices for data values are used. These are:
.omega..fwdarw.indicates that the calculation is indexed by
frequency in the MDCT line domain. b.fwdarw.indicates that the
calculation is indexed in the threshold calculation partition
domain. In the case where we do a convolution or sum in that
domain, bb will be used as the summation variable.
n.fwdarw.indicates that the calculation is indexed in the coder
band domain.
Additionally some symbols are also used: 1. The index of the
calculation partition, b. 2. The lowest frequency line in the
partition, .omega.low.sub.b. 3. The highest frequency line in the
partition, .omega.high.sub.b. 4. The median bark value of the
partition, bval.sub.b. 5. The value for tone masking noise (in dB)
for the partition, TMN.sub.b. 6. The value for noise masking tone
(in dB) for the partition, NMT.sub.b.
Several points in the following description refer to the "spreading
function". It is calculated by the following method:
tmpx=1.05(j-i), Where i is the bark value of the signal being
spread, j the bark value of the band being spread into, and tmpx is
a temporary variable. x=8minimum((tmpx-0.5).sup.2-2(tmpx-0.5),0)
Where x is a temporary variable, and minimum(a,b) is a function
returning the more negative of a or b.
tmpy=15.811389+7.5(tmpx+0.474)-17.5(1.+(tmpx+0.474).sup.2).sup.0.5
where tmpy is another temporary variable. .times.
.times.<.times. .times..times. .function..times. .times..times.
.function. ##EQU00004## Steps in Threshold Calculation
The following steps are the necessary steps for calculation the
SMR.sub.n used in the coder. 1. Concatenate 512 new samples of the
input signal to from another 1024 samples segment. Please refer to
FIG. 5a. 2. Calculate the complex spectrum of the input signal
using the O-FFT as described in 2.0 and using a sine window. 3.
Calculate a predicted r and .phi..
The polar representation of the transform is calculated
r.sub..omega. and .phi..sub..omega. represent the magnitude and
phase components of a spectral line of the transformed segment.
A predicted magnitude, {circumflex over (r)}.sub..omega., and phase
{circumflex over (.phi.)}.sub..omega., are calculated from the
preceding two threshold calculation blocks' r and .phi.:
{circumflex over
(r)}.sub..omega.=2r.sub.107(t-1)-r.sub..omega.(t-2) {circumflex
over
(.phi.)}.sub..omega.=2.phi..sub.107(t-1)-.phi..sub..omega.(t-2)
where t represents the current block number, t-1 indexes the
previous block's data, and t-2 indexes the data from the threshold
calculation block before that, 4. Calculate the unpredictability
measure c.sub..omega. c.omega., the unpredictability measure, is:
.omega..omega..times..times.
.times..PHI..omega..omega..times..times.
.times..PHI..omega..omega..times..times.
.times..PHI..omega..omega..times..times.
.times..PHI..omega..omega..function..omega. ##EQU00005## 5.
Calculate the energy and unpredictability in the threshold
calculation partitions.
The energy in each partition, e.sub.b, is:
.omega..omega..omega..times..omega. ##EQU00006## and the weighted
unpredictability, c.sub.b, is:
.omega..omega..omega..times..omega..times..omega. ##EQU00007## 6.
Convolve the partitioned energy and unpredictability with the
spreading function. .times. .times..times..times..function.
##EQU00008## .times. .times..times..times..function.
##EQU00008.2##
Because ct.sub.b is weighted by the signal energy, it must be
renormalized to cb.sub.b. ##EQU00009## At the same time, due to the
non-normalized nature of the spreading function, ecb.sub.b should
be renormalized and the normalized energy en.sub.b, calculated.
##EQU00010##
The normalization coefficient, rnorm.sub.b is: .times.
.times..times..function..times. ##EQU00011## 7. Convert cb.sub.b to
tb.sub.b. tb.sub.b=-0.299-0.43 log.sub.a(cb.sub.b) Each tb.sub.b is
limited to the range of 0.ltoreq.tb.sub.b.ltoreq.1. 8. Calculate
the required SNR in each partition. .times. ##EQU00012## .times.
##EQU00012.2##
Where TMN.sub.b is the tone masking noise in dB and NMT.sub.b is
the noise masking tone value in dB.
The required signal to noise ratio, SNR.sub.b, is:
SNR.sub.b=tb.sub.bTMN.sub.b+(1-tb.sub.b)NMT.sub.b 9. Calculate the
power ratio.
The power ratio, bc.sub.b, is: ##EQU00013## 10. Calculation of
actual energy threshold, nb.sub.b. nb.sub.b=en.sub.bbc.sub.b 11.
Spread the threshold energy over MDCT lines, yielding
nb.sub..omega..omega..omega..times. .times..omega..times. .times.
##EQU00014## 12. Include absolute thresholds, yielding the final
energy threshold of audibility, thr.sub..omega.
thr.sub..omega.max(nb.sub..omega.absthr.sub..omega.).
The dB values of absthr shown in the "Absolute Threshold Tables"
are relative to the level that a sine wave of .+-.1/2 lsb has in
the MDCT used for threshold calculation. The dB values must be
converted into the energy domain after considering the MDCT
normalization actually used. 13. Pre-echo control 14. Calculate the
signal to mask ratios, SMR.sub.n.
The table of "Bands of the Coder" shows 1. The index, n, of the
band. 2. The upper index, .omega.high.sub.n of the band n. The
lower index, .omega.low.sub.n, is computed from the previous band
as .omega.high.sub.n-1+1.
To further classify each band, another variable is created. The
width index, width.sub.n, will assume a value width.sub.n=1 if n is
a perceptually narrow band, and width.sub.n=0 if n is a
perceptually wide band. The former case occurs if
bval.sub..omega.high.sub.b-bval.sub..omega.low.sub.b<bandlength
bandlength is a parameter set in the initialization routine.
Otherwise the latter case is assumed.
Then, if (width.sub.n=1), the noise level in the coder band,
nband.sub.n is calculated as: .omega..omega..times.
.times..omega..times. .times..times..omega..omega..omega.
##EQU00015## else, nband.sub.n=minimum(thr.sub..omega.low.sub.n, .
. . ,thr.sub..omega.high.sub.n)
Where, in this case, minimum(a, . . . ,z) is a function returning
the most negative or smallest positive argument of the arguments a
. . . z.
The ratios to be sent to the decoder, SMR.sub.n, are calculated as:
.function..function. ##EQU00016##
It is important to emphasize that since the tonality measure is the
output of a spectrum analysis process, the analysis window has a
sine form for all the cases of large or short segments. In
particular, when a segment is chosen to be coded as a START or STOP
window, its tonality information is obtained considering a sine
window; the remaining operations, e.g. the threshold calculation
and the quantization of the coefficients, consider the spectrum
obtained with the appropriate window.
STEREOPHONIC THRESHOLD
The stereophonic threshold has several goals. It is known that most
of the time the two channels sound "alike". Thus, some correlation
exists that may be converted in coding gain. Looking into the
temporal representation of the two channels, this correlation is
not obvious. However, the spectral representation has a number of
interesting features that may advantageously be exploited. In fact,
a very practical and useful possibility is to create a new basis to
represent the two channels. This basis involves two orthogonal
vectors, the vector SUM and the vector DIFFERENCE defined by the
following linear combination: .function. ##EQU00017##
These vectors, which have the length of the window being used, are
generated in the frequency domain since the transform process is by
definition a linear operation. This has the advantage of
simplifying the computational load.
The first goal is to have a more decorrelated representation of the
two signals. The concentration of most of the energy in one of
these new channels is a consequence of the redundancy that exists
between RIGHT and LEFT channels and on average, leads always to a
coding gain.
A second goal is to correlate the quantization noise of the RIGHT
and LEFT channels and control the localization of the noise or the
unmasking effect. This problem arises if RIGHT and LEFT channels
are quantized and coded independently. This concept is exemplified
by the following context: supposing that the threshold by masking
for a particular signal has been calculated, two situations may be
created. First, we add to the signal an amount of noise that
corresponds to the threshold. If we present this same signal with
this same noise to the two ears then the noise is masked. However,
if we add an amount of noise that corresponds to the threshold to
the signal and present this combination to one ear; do the same
operation for the other ear but with noise uncorrelated with the
previous one, then the noise is not masked. In order to achieve
masking again, the noise at both ears must be reduced by a level
given by the masking level differences (MLD).
The unmasking problem may be generalized to the following form: the
quantization noise is not masked if it does not follow the
localization of the masking signal. Hence, in particular, we may
have two limit cases: center localization of the signal with
unmasking more noticeable on the sides of the listener and side
localization of the signal with unmasking more noticeable on the
center line.
The new vectors SUM and DIFFERENCE are very convenient because they
express the signal localized on the center and also on both sides
of the listener. Also, they enable to control the quantization
noise with center and side image. Thus, the unmasking problem is
solved by controlling the protection level for the MLD through
these vectors. Based on some psychoacoustic information and other
experiments and results, the MLD protection is particularly
critical for very low frequencies to about 3 KHz. It appears to
depend only on the signal power and not on its tonality properties.
The following expression for the MLD proved to give good results:
.function..function..times. .times..pi..times. .times..function.
##EQU00018## where i is the partition index of the spectrum (see
[7]), and b(i) is the bark frequency of the center of the partition
i. This expression is only valid for b(i).ltoreq.16.0 i.e. for
frequencies below 3 KHz. The expression for the MLD threshold is
given by: .function..function..times..function. ##EQU00019##
C(i) is the spread signal energy on the basilar membrane,
corresponding only to the partition i.
A third and last goal is to take advantage of a particular
stereophonic signal image to extract irrelevance from directions of
the signal that are masked by that image. In principle, this is
done only when the stereo image is strongly defined in one
direction, in order to not compromise the richness of the stereo
signal. Based on the vectors SUM and DIFFERENCE, this goal is
implemented by positioning the following two dual principles: 1. If
there is a strong depression of the signal (and hence of the noise)
on both sides of the listener, then an increase of the noise on the
middle line (center image) is perceptually tolerated. The upper
bound is the side noise. 2. If there is a strong localization of
the signal (and hence of the noise) on the middle line, then an
increase of the (correlated) noise on both sides is perceptually
tolerated. The upper bound is the center noise.
However, any increase of the noise level must be corrected by the
MLD threshold.
According to these goals, the final stereophonic threshold is
computed as follows. First, the thresholds for channels SUM and
DIFFERENCE are calculated using the monophonic models for
noise-masking-tone and tone-masking-noise. The procedure is exactly
the one presented in pages 25 and 26. At this point we have the
actual energy threshold per band, nb.sub.b for both channels. By
convenience, we call then THRn.sub.SUM and THRn.sub.DIF,
respectively for the channel SUM and the channel DIFFERENCE.
Secondly, the MLD threshold for both channels i.e. THRn.sub.MLD,SUM
and THRn.sub.MLD,DIF, are also calculated by: .times..times.
##EQU00020## .times..times. ##EQU00020.2## The MLD protection and
the stereo irrelevance are considered by computing:
nthr.sub.SUM=MAX[THRn.sub.SUM, MIN(THRn.sub.DIF, THRn.sub.MLD,DIF)]
nthr.sub.DIF=MAX[THRn.sub.DIF, MIN(THRn.sub.SUM,
THRn.sub.MLD,SUM)]
After these operations, the remaining steps after the 11th, as
presented in 3.2 are also taken for both channels. In essence,
these last thresholds are further adjusted to consider the absolute
threshold and also a partial premasking protection. It must be
noticed that this premasking protection was simply adopted from the
monophonic case. It considers a monaural time resolution of about 2
milliseconds. However, the binaural time resolution is as accurate
as 6 microseconds! To conveniently code stereo signals with
relevant stereo image based on interchannel time differences, is a
subject that needs further investigation.
STEREOPHONIC CODER
The simplified structure of the stereophonic coder allows for the
encoding of the stereo signals which are subsequently decoded by
the stereophonic decoder which, is presented in FIG. 12. For each
segment of data being analysed, detailed information about the
independent and relative behavior of both signal channels may be
available through the information given by large and short
transforms. This information is used according to the necessary
number of steps needed to code a particular segment. These steps
involve essentially the selection of the analysis window, the
definition on a band basis of the coding mode (R/L or S/D), the
quantization (704) and Huffman coding (705) of the coefficients
(708) and scale factors (707) and finally, the bitstream composing
(706) with a bit stream organization as depicted in FIG. 10.
Coding Mode Selection
When a new segment is read, the tonality updating for large and
short analysis windows is done. Monophonic thresholds and the PE
values are calculated according to the technique described
previously. This gives the first decision about the type of window
to be used for both channels.
Once the window sequence is chosen, an orthogonal coding decision
is then considered. It involves the choice between independent
coding of the channels, mode RIGHT/LEFT (R/L) or joint coding using
the SUM and DIFFERENCE channels (S/D). This decision is taken on a
band basis of the coder. This is based on the assumption that the
binaural perception is a function of the output of the same
critical bands at the two ears. If the threshold at the two
channels is very different, then there is no need for MLD
protection and the signals will not be more decorrelated if the
channels SUM and DIFFERENCE are considered. If the signals are such
that they generate a stereo image, then a MLD protection must be
activated and additional gains may be exploited by choosing the S/D
coding mode. A convenient way to detect this latter situation is by
comparing the monophonic threshold between RIGHT and LEFT channels.
If the thresholds in a particular band do not differ by more than a
predefined value, e.g. 2 dB, then the S/D coding mode is chosen.
Otherwise the independent mode R/L is assumed. Associated which
each band is a one bit flag that specifies the coding mode of that
band and that must be transmitted to the decoder as side
information. From now on it is called a coding mode flag.
The coding mode decision is adaptive in time since for the same
band it may differ for subsequent segments, and is also adaptive in
frequency since for the same segment, the coding mode for
subsequent bands may be different. An illustration of a coding
decision is given in FIG. 13. This illustration is valid for long
and also short segments.
At this point it is clear that since the window switching mechanism
involves only monphonic measures, the maximum number of PE measures
per segment is 10 (2 channels *[1 large window+4 short windows]).
However, the maximum number of thresholds that we may need to
compute per segment is 20 and therefore 20 tonality measures must
be always updated per segment (4 channels *[1 large window+4 short
windows]).
Bitrate Adjustment
It was previously said that the decisions for window switching and
for coding mode selection are orthogonal in the sense that they do
not depend on each other. Independent to these decisions is also
the final step of the coding process that involves quantization,
Huffman coding and bitstream composing: i.e. there is no feedback
path. This fact has the advantage of reducing the whole coding
delay to a minimum value (1024/48000=21.3 milliseconds) and also to
avoid instabilities due to unorthodox coding situations.
The quantization process effects both spectral and coefficients and
scale factors. Spectral coefficients are clustered in bands, each
band having the same step size or scale factor. Each step size is
directly computed from the masking threshold corresponding to its
band. The quantized values, which are integer numbers, are then
converted to variable word length of Huffman codes. The total
number of bits to code the segment, considering additional fields
of the bitstream, is computed. Since the bitrate must be kept
constant, the quantization process must be iteratively done till
that number of bits is within predefined limits. After the number
of bits needed to code the whole segment, considering the basic
masking threshold, the degree of adjustment is dictated by a buffer
control unit. This control unit shares the deficit or credit of
additional bits among several segments, according to the needs of
each one.
The technique of the bitrate adjustment routine is represented by
the flowchart of FIG. 9. It may be seen that after the total number
of available bits to be used by the current segment is computed, an
iterative procedure tries to find a factor .alpha. such that if all
the initial thresholds are multiplied by this factor, the final
total number of bits is smaller then and within an error .delta. of
the available number of bits. Even if the approximation curve is so
hostile that .alpha. is not found within the maximum number of
iterations, one acceptable solution is always available.
The main steps of this routine are depicted in FIG. 7 and FIG. 9 as
follows. First, an interval including the solution is found. Then,
a loop seeks to rapidly converge to the best solution. At each
iteration, the best solution is updated. Thus, the total number of
bits to represent the present whole segment (710) using the basic
masking threshold is evaluated. Next, the total number of bits
available to be used by the current segment is computed based on
the current buffer status from the buffer control (703). A
comparison (903) is made between the total number of bits available
in the buffer and the calculated total number of bits to represent
the current whole segment. If the required number of bits is less
than the available number of bits in the buffer, a further
comparison is made to determine if the final total number of bits
required is within an error factor of the available number of bits
(904). If within the error factor, the total number of bits
required to represent the current whole segment are transmitted
(916) to the entropy encoder (208). If not within the error factor,
an evaluation is done based upon the number of bits required to
represent the whole segment at the absolute threshold values (905).
If the required number of bits to represent the whole segment at
the absolute threshold values are less than the total number of
bits available (906) they are transmitted (916) to the entropy
encoder (208).
If at this point, neither the basic masking threshold nor absolute
thresholds have provided an acceptable bit representation of the
whole segment, an iterative procedure (as shown in 907 through 915)
is employed to establish the interpolation factor used as a
multiplier and discussed previously. If successful, the iterative
procedure will establish a bit representation of the whole segment
which is within the buffer limit and associated error factor.
Otherwise, after reaching a maximum number of iterations (908) the
iterative process will return the last best approximation (915) of
the whole segment as output (916).
In order to use the same procedure for segments coded with large
and short windows, in this latter case, the coefficients of the 4
short windows are clustered by concatenating homologue bands. Scale
factors are clustered in the same.
The bitrate adjustment routine (704) calls another routine that
computes the total number of bits to represent all the Huffman
coded words (705) (coefficients and scale factors). This latter
routine does a spectrum partioning according to the amplitude
distribution of the coefficients. The goal is to assign predefined
Huffman code books to sections of the spectrum. Each section groups
a variable number of bands and its coefficients are Huffman coded
with a convenient book. The limits of the section and the reference
of the code book must be sent to the decoder as side
information.
The spectrum partioning is done using a minimum cost strategy. The
main steps are as follows. First, all possible sections are defined
-the limit is one section per band- each one having the code book
that best matches the amplitude distribution of the coefficients
within that section. As the beginning and the end of the whole
spectrum is known, if K is the number of sections, there are K-1
separators between sections. The price to eliminate each separator
is computed. The separator that has a lower price is eliminated
(initial prices may be negative). Prices are compared again before
the next iteration. This process is repeated till a maximum
allowable number of sections is obtained and the smallest price to
eliminate another separator is higher than a predefined value.
Aspects of the processing accomplished by quantizer/rate-loop 206
in FIG. 2 will now be presented. In the prior art, rate-loop
mechanisms have contained assumptions related to the monophonic
case. With the shift from monophonic to stereophonic perceptual
coders, the demands placed upon the rate-loop are increased.
The inputs to quantizer/rate-loop 206 in FIG. 2 comprise spectral
coefficients (i.e., the MDCT coefficients) derived by analysis
filter bank 202, and outputs of perceptual model 204, including
calculated thresholds corresponding to the spectral
coefficients.
Quantizer/rate-loop 206 quantizes the spectral information based,
in part, on the calculated thresholds and the absolute thresholds
of hearing and in doing so provides a bitstream to entropy encoder
208. The bitstream includes signals divided into three part: (1) a
first part containing the standardized side information; (2) a
second part containing the scaling factors for the 35 or 56 bands
and additional side information used for so-called adaptive-window
switching, when used (the length of this part can vary depending on
information in the first part) and (3) a third part comprising the
quantized spectral coefficients.
A "utilized scale factor", .DELTA., is iteratively derived by
interpolating between a calculated scale factor and a scale factor
derived from the absolute threshold of hearing at the frequency
corresponding to the frequency of the respective spectral
coefficient to be quantized until the quantized spectral
coefficients can be encoded within permissible limits.
An illustrative embodiment of the present invention can be seen in
FIG. 13. As shown at 1301 quantizer/rate-loop receives a spectral
coefficient, C.sub.y, and an energy threshold, E, corresponding to
that spectral coefficient. A "threshold scale factor",
.DELTA..sub.o is calculated by .DELTA..times. ##EQU00021## An
"absolute scale factor", .DELTA..sub.A, is also calculated based
upon the absolute threshold of hearing (i.e., the quietest sound
that can be heard at the frequency corresponding to the scale
factor). Advantageously, an interpolation constant, .alpha., and
interpolation bounds .alpha..sub.high and .alpha..sub.low are
initialized to aid in the adjustment of the utilized scale factor.
.alpha..sub.high=1 .alpha..sub.low=0 .alpha.=.alpha..sub.high
Next, as shown in 1305, the utilized scale factor is determined
from:
.DELTA.=.DELTA..sub.o.sup..alpha..times..DELTA..sub.A.sup.(1-alpha)
Next, as shown in 1307, the utilized scale factor is itself
quantized because the utilized scale factor as computed above is
not discrete but is advantageously discrete when transmitted and
used. .DELTA.=Q.sup.-1(Q(.DELTA.))
Next, as shown in 1309, the spectral coefficient is quantized using
the utilized scale factor to create a "quantized spectral
coefficient" Q (C.sub.y, .DELTA.).
.function..DELTA..function..DELTA. ##EQU00022## where "NINT" is the
nearest integer function. Because quantizer/rate loop 206 must
transmit both the quantized spectral coefficient and the utilized
scale factor, a cost, C, is calculated which is associated with how
many bits it will take to transmit them both. As shown in FIG.
1311, C=FOO(Q(C.sub.y, .DELTA.), Q(.DELTA.)) where FOO is a
function which, depending on the specific embodiment, can be easily
determined by persons having ordinary skill in the art of data
communications. As shown in 1313, the cost, C is tested to
determine whether it is in a permissible range PR. When the cost is
within the permissible range, Q (C.sub.y, .DELTA.) and Q(.DELTA.)
are transmitted to entropy coder 208.
Advantageously, and depending on the relationship of the cost C to
the permissible range PR the interpolation constant and bounds are
adjusted until the utilized scale factor yields a quantized
spectral coefficient which has a cost within the permissible range.
Illustratively, as shown in FIG. 13 at 1313, the interpolation
bounds are manipulated to produce a binary search. Specifically,
when C>PR, .alpha..sub.high=.alpha., alternately, when C<PR,
.alpha..sub.low=.alpha.. In either case, a new interpolation
constant is calculated by: .alpha..alpha..alpha. ##EQU00023## The
process then continues at 1305 iteratively until the C comes within
the permissible range PR. STEREOPHONIC DECODER
The stereophonic decoder has a very simple structure as shown in
FIG. 12. Its main functions are reading the incoming bitstream
(1202), decoding all the data (1203), inverse quantization and
reconstruction of RIGHT and LEFT channels (1204). The technique is
represented in FIG. 12. Thus, the decoder is performing
complementary operations to that of the encoder depicted in FIG. 7
such as operations that are complementary to quantization (704) and
Huffman coding (705).
Illustrative embodiments may comprise digital signal processor
(DSP) hardware, such as the AT&T DSP16 or DSP32C, and software
performing the operations discussed below of the present invention.
Very large scale integration (VLSI) hardware embodiments of the
present invention, as well as hybrid DSP/VLSI embodiments, may also
be provided. For example, an AT&T DSP16 may be employed to
perform the operations of the rate loop processor depicted in FIG.
13. The DSP could receive the spectral coefficients and energy
thresholds (1301) and perform the calculation of blocks 1303 and
1305 as described on page 31. Further, the DSP could calculate the
utilized scale factor according to the equation given on page 32
and depicted in block 1305. The quantization blocks 1307 and 1308
can be carried out as described on page 32. Finally, the DSP may
perform the cost calculation (1311) and comparison (1313)
associated with quantization. The cost calculation is described on
page 32 and illustrated further in FIG. 9. In this way, the
interpolation factor may be adjusted (1315) according to the
analysis carried out within the DSP or similar type hardware
embodiments. It is to be understood that the above-described
embodiments is merely illustrative of the principles of this
invention. Other arrangements may be derived by those skilled in
the art without departing from the spirit and scope of the
invention.
* * * * *