Home
Patent Search
IMT Blog
REGISTER
|
SIGN IN
United States Patent
5974503
Venkatesh , ; et al.
October 26, 1999
Title
Storage and access of continuous media files indexed as lists of raid stripe sets associated with file names
Abstract
A continuous media file is comprised of stripe sets over disk drives in one or more RAID sets. In a preferred embodiment, the RAID set includes n disk drives. The data storage of each disk drive in the RAID set is partitioned into an integer number m of hyper-volumes, and the parity is stored in one hyper-volume of each of m disk drives in the RAID set. The stripe set includes a series of transfer units of data in respective ones of the disk drives. Each transfer unit includes an integer number j of data blocks, and each hyper-volume includes an integer number k of transfer units. Each stripe set includes (m)(n-1) transfer units of data. The transfer units of the RAID set are allocated for the storage of continuous media data in a right-to-left and then top-to-bottom order in which the transfer units appear in an m row by n column matrix in which the rows of the matrix represent parity groups of hyper-volumes in the disk drives and the columns of the matrix represent storage in the respective disk drives. At most one write access to each parity hyper-volume need be performed during write access to a stripe set. Parity changes for the data being written are accumulated in non-volatile memory, and written to the RAID set after completion of the writing of the data.
Inventors:
Venkatesh; Dinesh
(Brookline,
MA
)
, Forecast; John
(Newton,
MA
)
, Duso; Wayne W.
(Shrewsbury,
MA
)
Assignee:
EMC Corporation
(Hopkinton,
MA
)
Appl. No.:
851509
Filed:
May 5, 1997
Current U.S. Class:
711/114
714/6
714/7
Field of Search:
711/114 371/51.1,40.4 395/182.04,182.05
U.S. Patent Documents
4608688
August 1986
Hansen et al.
4755928
July 1988
Johnson et al.
4761785
August 1988
Clark et al.
5072378
December 1991
Manka
5148432
September 1992
Gordon et al.
5175837
December 1992
Arnold et al.
5206939
April 1993
Yani et al.
5208665
May 1993
McCalley et al.
5208813
May 1993
Stallmo et al.
5218695
June 1993
Noveck et al.
5235601
August 1993
Stallmo et al.
5255270
October 1993
Yanai et al.
5263145
November 1993
Brady et al.
5269011
December 1993
Yanai et al.
5274799
December 1993
Brant et al.
5276860
January 1994
Fortier et al.
5276867
January 1994
Kenley et al.
5335352
August 1994
Yanai et al.
5341493
August 1994
Yanai et al.
5367698
November 1994
Webber et al.
5371532
December 1994
Gelman et al.
5381539
January 1995
Yanai et al.
5408465
April 1995
Gusella et al.
5410343
April 1995
Coddington et al.
5414455
May 1995
Hooper et al.
5442390
August 1995
Hooper et al.
5442749
August 1995
Northcutt et al.
5442771
August 1995
Filepp et al.
5477263
December 1995
O'Callaghan et al.
5508733
April 1996
Kassatly
5519435
May 1996
Anderson
5528282
June 1996
Voeten et al.
5528513
June 1996
Vaitzblit et al.
5530557
June 1996
Asit et al.
5533021
July 1996
Branstad et al.
5534912
July 1996
Kostreski
5537408
July 1996
Branstad et al.
5537534
July 1996
Voigt et al.
5539660
July 1996
Blair et al.
5544313
August 1996
Shachnai et al.
5544327
August 1996
Dan et al.
5544345
August 1996
Carpenter et al.
5544347
August 1996
Yanai et al.
5550577
August 1996
Verbiest et al.
5550982
August 1996
Long et al.
5553005
September 1996
Voeten et al.
5557317
September 1996
Nishio et al.
5559949
September 1996
Reimer et al.
5572660
November 1996
Jones
5574662
November 1996
Windrem et al.
5579475
November 1996
Blaum et al.
5583561
December 1996
Baker et al.
5586264
December 1996
Belknap et al.
5594910
January 1997
Filepp et al.
5603058
February 1997
Belknap et al.
5606359
February 1997
Youden et al.
5625405
April 1997
DuLac et al.
5633810
May 1997
Mandal et al.
5657439
August 1997
Jones et al.
5812753
September 1998
Chiariotti
Foreign Patent Documents
0 061 570 A3
Oct., 1982
EP
0 683 464 A2
Nov., 1995
EP
0 697 660 A1
Feb., 1996
EP
633694 A1
Jan., 1995
EP
WO 93/16557
Aug., 1993
WO
WO 94/00816
Jan., 1994
WO
WO 95/10918
Apr., 1995
WO
WO 97/16023
May., 1997
WO
Other References
Dishon et al., "A Highly Available Storage System Using the Checksum Method," Seventeenth International Symposium on Fault-Tolerant Computing, Jul. 6-8, 1987, Pittsburg, Penn., IEEE Computer Society, IEEE, New York, N.Y., pp. 178-181. .
Martin E. Schulze, "Considerations in the Design of a RAID Prototype," Computer Science Division, EECS, University of California at Berkeley, CA, Aug. 25, 1988. .
Katz et al., "Disk System Architectures for High Performance Computing," Computer Science Division, EECS, University of California at Berkeley, CA, Mar. 1989. .
Gray et al., "Parity Striping of Disc Arrays: Low-Cost Reliable Storage with Acceptable Throughput," Technical Report 90.2, Tandem Computers, Cupertino, CA., Jan. 1990. .
Edward K. Lee, "Software and Performance Issues in the Implementation of a RAID Prototype," Computer Science Division, EECS, University of California at Berkeley, CA, Mar. 1989. .
Sincoskie, WD, "System architecture for a large scale video on demand service," Computer Networks and ISDN Systems, vol. 22, No. 2, Sep. 1991, pp. 155-162. .
Tobagi FA, Pang J, "StarWorks (Trademark)--A video applications server," Proceedings, IEEE COMPCON 93, San Francisco, Calif., 1993, pp. 4-11. .
Vaitzblit L, "The design and implementation of a high bandwidth file service for continuous media," Master's Thesis, Massachusetts Institute of Technology, Cambridge, Mass., Nov. 4, 1991. .
Patterson et al., "A Case for Redundant Arrays Of Inexpensive Disks (RAID)," Report No. UCB/CSD 87/391, Computer Science Division (EECS), University of California, Berkeley, California, Dec. 1987, pp. 1-24. .
Patterson et al., "Introduction to Redundant Arrays of Inexpensive Disks (RAID)," COMPCON 89 Proceedings, Feb. 27-Mar. 3, 1989, IEEE Computer Society, pp. 112-117. .
Ousterhout et al., "Beating the I/O Bottleneck: A Case for Log-Structured File Systems," Operating Systems Review, vol. 23, No. 1, ACM Press, Jan., 1989, pp. 11-28. .
Douglis et al., "Log Structured File Systems," COMPCON 89 Proceedings, Feb. 27-Mar. 3, 1989, IEEE Computer Society, pp. 124-129. .
Rosemblum et al., "The Design and Implementation of a Log-Structured File System," ACM Transactions on Computer Systems, vol. 1, Feb. 1992, pp. 26-52. .
M(aurice) William Collins, "A Network File Storage System," IEEE Seventh Symposium on Mass Storage Systems, Nov. 4-7, 1985, Tuscon, AZ, pp. 1-11, Los Alamos Nat. Lab. No. LA-UR-85-3183. .
John H. Howard, "An Overview of the Andrew File System," USENIX Winter Conference, Feb. 9-12, 1988, Dallas, TX, p. 23-26. .
John H. Howard et al., "Scale and Performance in a Distributed File System," ACM Transactions on Computer Systems, vol. 6, No. 1, Feb. 1988, p. 51-81. .
David C. Steere et al., "Efficient User-Level File Cache Management on the Sun Vnode Interface," USENIX Summer Conference, Jun. 11-15, 1990, Anaheim, CA, p. 325-331. .
Matt Blaze et al., "Long-Term Caching Strategies for Very Large Distributed File Systems," USENIX, Summer '91, Nashville, TN, p. 3-15. .
Thomas W. Page, Jr., et al., "Management of Replicated Volume Location Data in the Ficus Replicated File System," USENIX, Summer '91, Nashville, TN, p. 17-29. .
Storage Computer Corporation, "High Performance, Fault Tolerant Disk Array Platform For File Servers And Computer Systems," 1991, Nashua, NH 12 pages. .
France Telecom, "Telesauvegarde," 26 page paper dated Nov. 2, 1994 about work based on France Telecom patent application "Dispositif et Proceed de Sauvegaude a Distance de Donnees Numeriques" [Method and Apparatus for Safeguarding at a Distance Numeric Data], Institute National de la Propiete Industrielle, France, Appln. No. 93.12771, Oct. 28, 1994 (filed Oct. 21, 1993), and partial English translation (15 pages). .
Krishnan Natarajan, "Video Servers Take Root," IEEE Spectrum, Apr. 1995, IEEE, New York, NY, pp. 66-69. .
K. K. Ramakrishnan et al., "Operating system support for a video-on-demand file service," Multimedia Systems (1995) 3:53-65. .
Ralf Steinmetz, "Analyzing the Multimedia Operating System," IEEE MultiMedia, Spring 1995, pp. 68-84. .
Audrey Chou, "EMC, Computer-Storage Leader, Still Hears Footsteps," The Wall Street Journal, Aug. 9, 1995, Dow Jones & Co., Princeton, N.J. .
Pardhu Vadlamudi, "EMC Hunts for Solution to Bottlenecks," InfoWorld, Apr. 15, 1996, #1590, San Mateo, CA 94402. .
Michael Goldberg, "EMC to Pump Data Over Networks," Computerworld, Apr. 15, 1996. .
"EMC Moves Into Servers," Broadcasting Cable, Apr. 15, 1996. .
"Symmetrix Model 55XX Product Manual, P/N 200-810-550 Rev D," EMC Corporation, Hopkinton, Mass., May 1994, pp. 1-236. .
"NFS: Network File System Protocol Specification," RFC 1094, Sun Microsystems, Inc., Mar. 1989, pp. 1-27. .
J. Case, M. Fedor, M. Schoffstall, J. Davin, "A Simple Network Management Protocol (SNMP)," May 1990, MIT Laboratory for Computer Science, Cambridge Mass., pp. 1-35. .
Rangen PV, Vin HM, "Designing file systems for digital video and audio," Proceedings of the 13th ACM Symposium on Operating Systems Principles, Monterey, Calif., 1991, pp. 81-94. .
Vin HM, Rangan PV, (1993) "Designing a multiuser HDTV storage server," IEEE Journal on Selected Areas in Communications, vol. 11, No. 1, Jan. 1993, pp. 153-164. .
Anderson DP, Osawa Y, Govindan R, "A file system for continuous media," ACM Transactions on Computer Systems, vol. 10., No. 4, Nov. 1992, pp. 311-337. .
Federighi C, "A Distributed Hierarchical Storage Manager for a Video-on-Demand System," Department of Electrical Engr. and Computer Science, University of California, Berkeley, California, Dec. 1993. .
Haskins R, "The shark continuous-media file server," Proceedings IEEE COMPCOM 93, San Francisco, Calif., 1993, pp. 12-15. .
Little TD, Rhanger G, Folz RJ, Gibbon JF, Reeve FW, Schelleng DH, Venkatesh D, "A digital on-demand video service supporting content based queries," Proceedings of ACM Multimedia 93, Anaheim, Calif., Aug. 1-6, 1993, pp. 427-436. .
Lougher, P. Sheperd, D. "The design of a storage server for continuous media," The Computer Journal, vol. 36, No. 1, 1993, pp. 32-42. .
Rangan PV, Vin HM, Ramanathan S, "Designing an on-demand multimedia service," IEEE Communications Magazine, vol. 30, No. 7, Jul. 1992, pp. 56-64..~
Primary Examiner:
Cabeca; John W.
Assistant Examiner:
Moazzami; Nasser
Attorney, Agent or Firm:
Arnold White & Durkee
Parent Case Text
RELATED APPLICATIONS
The present application is a continuation of provisional application Serial No. 60/044,948 filed Apr. 25, 1997 by Dinesh Venkatesh, Wayne W. Duso, John Forecast, Uday Gupta, Uresh K. Vahalia, and Dennis P. J. Ting, entitled "Raid Striping, Client-Server Protocols, and Failover Services for a Video File Server."
Claims
What is claimed is:
1. A method of striping a sequence of continuous media data across a plurality of n disk drives in a RAID set storing data and associated parity across the disk drives, wherein the data storage of each disk drive in the RAID set is partitioned into an integer number m of hyper-volumes, parity is distributed among the disk drives in the RAID set and is stored in at least one hyper-volume of each of m disk drives in the RAID set, and transfer units of data storage in the n disk drives are associated with the sequence of continuous media data in a right-to-left and then top-to-bottom order in which the transfer units appear in an m row by n column matrix in which the rows of the matrix represent parity groups of hyper-volumes in the disk drives and the columns of the matrix represent the data storage in the respective n disk drives in the RAID set, which further includes allocating data storage in the RAID set for the stream of continuous media data in fixed size stripe sets, each stripe set comprising a segment of the sequence of continuous media data distributed across each of the disk drives and each of the parity groups of hyper-volumes in the disk drives,
wherein the transfer unit has a predetermined size, and each stripe set includes (m)(n-1) transfer units of data.
2. A method of striping a sequence of continuous media data across a plurality of n disk drives in a RAID set storing data and associated parity across the disk drives, wherein the data storage of each disk drive in the RAID set is partitioned into an integer number m of hyper-volumes, parity is distributed among the disk drives in the RAID set and is stored in at least one hyper-volume of each of m disk drives in the RAID set, and transfer units of data storage in the n disk drives are associated with the sequence of continuous media data in a right-to-left and then top-to-bottom order in which the transfer units appear in an m row by n column matrix in which the rows of the matrix represent parity groups of hyper-volumes in the disk drives and the columns of the matrix represent the data storage in the respective n disk drives in the RAID set, which further includes allocating data storage in the RAID set for the stream of continuous media data in fixed size stripe sets, each stripe set comprising a segment of the sequence of continuous media data distributed across each of the disk drives and each of the parity groups of hyper-volumes in the disk drives, and which includes associating a name of a continuous media data file with a list of the stripe sets.
3. The method as claimed in claim 1, which includes computing parity for each of the parity groups by computing parity changes for different ones of the disk drives due to storing of the sequence of continuous media data in the disk drives, reading the parity once from each of the parity groups, applying the parity changes for the different ones of the disk drives to the parity read once for each of the parity groups to produce modified parity, and writing the modified parity back to said each of the parity groups.
4. A method of accessing a continuous media data file including continuous media data stored in stripe sets in a RAID set of disk drives, said method comprising the steps of:
a) accessing a directory of file names of continuous media data files to locate a list of the stripe sets associated with a name of the continuous media data file;
b) accessing the list of stripe sets to determine a sequence of the stripe sets storing the continuous media data; and
c) accessing the RAID set of disk drives to retrieve the sequence of stripe sets.
5. The method as claimed in claim 4, wherein the list of stripe sets includes at least one entry specifying a starting stripe set number and an ending stripe set number, and said method includes accessing the RAID set of disk drives to retrieve a stripe set identified by the starting stripe set number, to retrieve stripe sets identified by stripe set numbers between the starting stripe set number and the ending stripe set number, and to retrieve a stripe set identified by the ending stripe set number.
6. The method as claimed in claim 4, wherein the stripe set is comprised of a series of transfer units in disk drives of the RAID set, and said method includes the step of sequentially advancing an index and accessing a look-up table with the index in order to determine a disk drive in the RAID set containing a next transfer unit to be accessed.
7. A data storage system comprising:
a plurality of n disk drives arranged in a RAID set; and
a controller coupled to the disk drives for accessing a sequence of continuous media data stored in the disk drives;
wherein the sequence of continuous media data and associated parity is striped across the disk drives in the RAID set, the data storage of each disk drive in the RAID set is partitioned into an integer number m of hyper-volumes, parity is distributed among the disk drives in the RAID set and is contained in at least one hyper-volume of each of the m disk drives in the RAID set, and transfer units of data storage in the n disk drives are associated with the sequence of continuous media data in a right-to-left and then top-to-bottom order in which the transfer units appear in an m row by n column matrix in which the rows of the matrix represent parity groups of hyper-volumes in the disk drives and the columns of the matrix represent the data storage in the respective n disk drives in the RAID set,
wherein data storage in the RAID set for the stream of continuous media data is allocated in fixed size stripe sets, each stripe set comprising a segment of the sequence of continuous media data distributed across each of the disk drives and each of the parity groups of hyper-volumes in the disk drives, and wherein the transfer unit has a predetermined size, and each stripe set includes (m)(n-1) transfer units of data.
8. A data storage system comprising:
a plurality of n disk drives arranged in a RAID set; and
a controller coupled to the disk drives for accessing a sequence of continuous media data stored in the disk drives;
wherein the sequence of continuous media data and associated parity is striped across the disk drives in the RAID set, the data storage of each disk drive in the RAID set is partitioned into an integer number m of hyper-volumes, parity is distributed among the disk drives in the RAID set and is contained in at least one hyper-volume of each of the m disk drives in the RAID set, and transfer units of data storage in the n disk drives are associated with the sequence of continuous media data in a right-to-left and then top-to-bottom order in which the transfer units appear in an m row by n column matrix in which the rows of the matrix represent parity groups of hyper-volumes in the disk drives and the columns of the matrix represent the data storage in the respective n disk drives in the RAID set, wherein data storage in the RAID set for the stream of continuous media data is allocated in fixed size stripe sets, each stripe set comprising a segment of the sequence of continuous media data distributed across each of the disk drives and each of the parity groups of hyper-volumes in the disk drives, and wherein the controller has a directory associating a name of a continuous media data file with a list of the stripe sets.
9. The data storage system as claimed in claim 7, wherein the controller is programmed to compute parity for each of the parity groups by computing parity changes for different ones of the disk drives due to storing of the sequence of continuous media data in the disk drives, to read the parity once from each of the parity groups, to apply the parity changes for the different ones of the disk drives to the parity read once for each of the parity groups to produce modified parity, and to write the modified parity back to said each of the parity groups.
10. A data storage system comprising:
a plurality of disk drives arranged in a RAID set; and
a controller coupled to the disk drives for accessing a sequence of continuous media data stored in stripe sets in a RAID set of disk drives, the controller having a directory of continuous media data files and lists of the stripe sets associated with each of the continuous media data files, the controller being programmed to respond to a request to access a specified continuous media data file by accessing the directory to find a list of stripe sets associated with the specified continuous media data file, to access the list of stripe sets to determine a sequence of the stripe sets storing the continuous media data, and to access the RAID set of disk drives to retrieve the sequence of stripe sets.
11. A data storage system comprising:
a plurality of disk drives arranged in a RAID set; and
a controller coupled to the disk drives for accessing a sequence of continuous media data stored in stripe sets in a RAID set of disk drives, the controller having a directory of continuous media data files and lists of the stripe sets associated with each of the continuous media data files, the controller being programmed to respond to a request to access a specified continuous media data file by accessing the directory to find a list of stripe sets associated with the specified continuous media data file, to access the list of stripe sets to determine a sequence of the stripe sets storing the continuous media data, and to access the RAID set of disk drives to retrieve the sequence of stripe sets, wherein the list of stripe sets includes at least one entry specifying a starting stripe set number and an ending stripe set number, and the controller is programmed to access the RAID set of disk drives to retrieve a stripe set identified by the starting stripe set number, to retrieve stripe sets identified by stripe set numbers between the starting stripe set number and the ending stripe set number, and to retrieve a stripe set identified by the ending stripe set number.
12. A data storage system comprising:
a plurality of disk drives arranged in a RAID set; and
a controller coupled to the disk drives for accessing a sequence of continuous media data stored in stripe sets in a RAID set of disk drives, the controller having a directory of continuous media data files and lists of the stripe sets associated with each of the continuous media data files, the controller being programmed to respond to a request to access a specified continuous media data file by accessing the directory to find a list of stripe sets associated with the specified continuous media data file, to access the list of stripe sets to determine a sequence of the stripe sets storing the continuous media data, and to access the RAID set of disk drives to retrieve the sequence of stripe sets, and wherein the stripe set is comprised of a series of transfer units in disk drives of the RAID set, and the controller is programmed to sequentially advance an index and access a look-up table with the index in order to determine a disk drive in the RAID set containing a next transfer unit to be accessed.
Description
AUTHORIZATION PURSUANT TO 37 C.F.R .sctn. 1.17(E)
A portion of the disclosure of this patent document contains command formats and other computer language listings all of which are subject to copyright protection. The copyright owner, EMC Corporation, has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to a data storage system employing a redundant array of disk drives (RAID), and more particularly to a continuous media or video file server employing RAID techniques.
2. Background Art
In a data storage system employing RAID techniques, a set of disk drives are configured into a RAID set storing data and parity computed across the disk drives in the RAID set. Therefore, if one of the disk drives in the RAID set fails, data in the failed disk drive can be reconstructed from the data and parity information in the other disk drives in the RAID set.
In a file server employing RAID techniques, it is conventional to employ a file directory for mapping a file name to a list of data blocks stored in respective ones of the disk drives in the RAID set. The data blocks are typically of a fixed size, such as 512 bytes, that are conveniently handled by the buffers and control circuitry provided by commodity 5 and 1/4 inch or 3 and 1/2 inch disk drives. A write access to a data block involves reading old data from an addressed disk drive, reading old parity corresponding to the old data from another disk drive in the raid set, computing a change in parity by an exclusive-OR of the old data with the new data to be written, computing the new parity by an exclusive-OR of the change in parity with the old parity, and then, in a single transaction, writing the new data and the new parity back to the respective disk drives.
It is desirable to use a data storage system employing RAID techniques for storing continuous media such as digital video. This could be done by using the conventional RAID techniques described above for a file server. However, this has been found to cause computational inefficiencies and load balancing and data availability problems under normal operation and under a failure mode of operation when there has been a failure of one of the disk drives in the RAID set.
SUMMARY OF THE INVENTION
The computational inefficiencies, data availability problems, and load balancing problems associated with storing continuous media in a conventional RAID storage system result from the relatively large size, relatively high access rate, and non-uniform access frequency of continuous media files. Because continuous media files are typically accessed in a sequential or isochronous fashion, it is desirable to access a relatively large transfer unit of continuous media data during each disk access due to the fixed overhead of managing each disk access. However, it is very likely that one continuous media file, such as a file for a popular movie, could exceed the capacity of a single disk drive, and could have an access frequency exceeding the combined access frequency of all other files stored in the RAID set. Moreover, in video-on-demand applications, multiple users or clients could be simultaneously accessing different portions of the same continuous media file. Therefore, for high data availability, it is desirable to distribute the continuous media data for each file over all of the disk drives in each RAID set, and for load balancing, it is also desirable to distribute the parity information for each continuous media file over a multiplicity of the disk drives.
It is also desirable to minimize the overhead for indexing the data blocks in each continuous media file but at the same time provide an efficient mechanism for editing of the continuous media file. A continuous media file, such as a movie, could be edited to satisfy local law or custom regarding indecent or profane subject matter, for time compression, or to employ more advanced digital encoding techniques. Such editing is likely to either reduce or enlarge the amount of data in the file. It is desirable to permit immediate playback after such editing yet provide a mechanism for compaction and recovery of storage when the amount of data in the file has been reduced.
To solve these problems, the storage system uses a relatively large transfer unit including multiple data blocks when accessing a continuous media file, yet distributes the data of each continuous media file over all of the disk drives in the RAID set. Moreover, to provide load balancing, the parity associated with the data of each continuous media file is distributed across a multiplicity of disk drives in the RAID set, and preferably is distributed across all of the disk drives in the RAID set which store parity information.
In a preferred embodiment, storage in the RAID set is subdivided into a multiplicity of predefined stripe sets, and a respective number of the stripe sets are allocated to each continuous media file. Contiguous data in each stripe set is grouped into transfer blocks of multiple data blocks stored in respective individual disk drives, and each transfer block is associated with parity stored in a single one of the disk drives in the RAID set. For each disk drive in the RAID set that stores parity, the stripe set includes a transfer unit of data that is from each other disk drive in the RAID set and that is associated with that parity. Moreover, to simplify parity computation, the data in each stripe set is contiguous over the set of transfer blocks associated with the parity in each disk drive of the RAID set that includes parity.
In a specific embodiment, the RAID set includes n disk drives. The data storage of each disk drive in the RAID set is partitioned into an integer number m of hyper-volumes, and the parity is stored in one hyper-volume of each of m disk drives in the RAID set. The transfer unit includes an integer number j of data blocks, and each hyper-volume includes an integer number k of transfer units. Each stripe set includes (m)(n-1) transfer units of data. The transfer units of the RAID set are allocated for the storage of continuous media data in a right-to-left and then top-to-bottom order in which the transfer units appear in an m row by n column matrix in which the rows of the matrix represent parity groups of hyper-volumes in the disk drives and the columns of the matrix represent the storage in the respective disk drives. For example, the RAID set includes eight disk drives, each having a capacity of 4 gigabytes. Each disk drive is partitioned into 4 hyper-volumes. The block size is 512 bytes, and the transfer unit is 256 blocks; i.e., 128 K bytes.
Because parity groups are formed of contiguous transfer units in each stripe set, at most one write access to each parity hyper-volume need be performed during write access to a stripe set. For example, for each parity group, old data from a first transfer unit is read from disk storage and exclusive-ORed with new data for the first transfer unit to produce a transfer unit of parity changes. The parity changes are stored in non-volatile memory, and then the new data for the first transfer unit is written to disk storage. Old data from a next transfer unit in the parity group is read from disk storage, exclusive-ORed with new data from the next transfer unit to produce a transfer unit of parity changes, the parity changes are exclusive-ORed with the parity changes in non-volatile memory and the result is stored back in non-volatile memory, and then the new data from the next transfer unit is stored to disk memory. The transfer unit of parity for the parity group is read from disk storage. Once all of the new data for the parity group has been written to disk storage, the parity read from the parity group is exclusive-ORed with the parity changes in the non-volatile storage to compute the new parity, the new parity is written to non-volatile storage, then the new parity is written to disk storage, and finally the non-volatile memory is deallocated.
In the preferred Embodiment, each stripe set is assigned an identification number from which can be computed in sequence the starting block addresses of the transfer units in the stripe set. The continuous media data is logically organized as individually named files called "clips". A clip directory associates a clip identification number with a list of allocated stripe sets.
The stripe set list associated with each clip, for example, includes a doubly-linked list of entries, and each entry includes a starting stripe set number, an ending stripe set number, and a value indicating the number of data blocks included in the terminal stripe set. Therefore, each entry in the list represents in sequence data blocks beginning in the initial stripe set, continuing in any intermediate stripe set, and ending in the terminal stripe set, and including in the terminal stripe set only the indicated number of data blocks. The stripe set list for each clip can therefore easily be edited by linking and unlinking entries. When editing of the clip results in a number of stripe sets that are partially empty, compaction can be performed as a background operation by copying data to newly allocated stripe sets, unlinking the entries pointing to the old stripe sets, linking to new entries pointing to the newly allocated stripe sets, and deallocating the old stripe sets. Stripe sets are allocated by removing them from a stripe set free list, and de-allocated by returning them to the stripe set free list. Each entry in the stripe set free list, for example, includes a starting stripe set number and an ending stripe set number, to indicate a range of unallocated stripe sets.
BRIEF DESCRIPTION OF THE DRAWINGS
Other objects and advantages of the invention will become apparent upon reading the following detailed description with reference to the accompanying drawings wherein:
FIG. 1 is a perspective view of a video file server that incorporates the present invention;
FIG. 2 is a block diagram of the video file server of FIG. 1 and its connections to a network;
FIG. 3 is a block diagram of a cached disk array storage system used in the video file server of FIG. 1;
FIG. 4 is a block diagram showing software structure in the video file server of FIG. 1;
FIG. 5 is a more detailed block diagram showing various modules of the software structure of FIG. 4;
FIG. 6 is a specific example of software modules of FIG. 4 that can be used in an interactive video application to provide transaction processing and continuous media file access;
FIG. 7 is a schematic diagram illustrating scheduling operations by a kernel software module of the video file server of FIG. 1;
FIG. 8 is a timing diagram showing the accommodation of non real-time requests by the kernel software module of the video file server of FIG. 1;
FIG. 9 is a schematic diagram illustrating data flow in the video file server of FIG. 1 from the disk array to a network client;
FIG. 10 is a flowchart of a prefetch task of a stream server in the video file server of FIG. 1;
FIG. 11 is a flowchart of a video prefetch procedure of the cached disk array in the video file server of FIG. 1;
FIG. 12 is a flowchart of a video fetch procedure of the cached disk array in the video file server of FIG. 1;
FIG. 13 is a schematic diagram similar to FIG. 9 but showing how a second stream server in the video file server can access data having been prefetched from the disk array for a first stream server of the video file server;
FIG. 14 is a first part of a flowchart of a subroutine for determining whether sufficient cache or disk resources are presently available in the cache disk array for supporting a requested video stream, and if so, determining whether more than a minimum amount of cache memory should be allocated to support the requested video stream;
FIG. 15 is a second part of the flowchart begun in FIG. 14;
FIG. 16 is a schematic diagram showing "movie-on-demand" service to numerous network clients simultaneously viewing different portions of a movie;
FIG. 17 is a flowchart of a routine for servicing requests from network clients for "movie-on-demand" service in accordance with the schematic diagram in FIG. 16;
FIG. 18 is a flowchart of steps that could be added to the routine of FIG. 17 to dynamically allocate RAM windows of the stream servers of FIG. 2 in anticipation of client requests for "movie-on-demand" service;
FIG. 19 is a schematic diagram illustrating data flow in the video file server of FIG. 1 during "on-line" tape backup operations;
FIG. 20 is a block diagram showing a distribution of software used in the video file server of FIG. 1 for the "on-line" tape backup operations of FIG. 19;
FIG. 21 is a memory map of a RAID set including eight disk drives;
FIG. 22 is a portion of a transfer unit mapping table that gives the disk drive number and hyper-volume number of each transfer unit in a stripe set;
FIG. 23 is a first sheet of a flowchart of a procedure for providing write access to a stripe set;
FIG. 24 is a second sheet of the flowchart begun in FIG. 23;
FIG. 25 is a block diagram of a clip directory and information associated with each clip including a list of stripe sets comprising the clip;
FIG. 26 is a diagram showing a free stripe set list;
FIG. 27 is a block diagram of a client directory and information associated with each active client including a play list of clips;
FIG. 28 is a diagram showing locations of the data structures of FIG. 21 in the video file server of FIG. 2;
FIG. 29 is a state diagram of controller server states when processing recording commands from a client;
FIG. 30 is a state diagram of controller server states when processing play commands from a client;
FIG. 31 is a flow chart of a program executed by an active controller server in response to an "open play" command from a client;
FIG. 32 is a flow chart of a program executed by an active controller server in response to a "pause" command from a client;
FIG. 33 is a flow chart of a program executed by an active controller server when reaching the end of a play-list during the playing of continuous media data for a client;
FIG. 34 is a first portion of a flow diagram of a client-server protocol for open network connectivity and broadcast automation;
FIG. 35 is a second portion of a flow diagram begun in FIG. 34;
FIG. 36 is a state diagram of controller server states during the client-server protocol introduced in FIGS. 34 and 35;
FIG. 37 is a flow chart of a program routine in the controller server for processing an edit command and checking whether or not an edit is too close to broadcast time to avoid interruption of a broadcast transmission;
FIG. 38 is a flow chart of a top-level program routine loaded into each of the controller servers;
FIG. 39 is a flow chart of a program routine executed by the controller server having slave status;
FIG. 40 is a flow chart of a program routine executed by the controller server having master status;
FIG. 41 is a flow chart of a program routine executed by the controller server having master status for recovering from a stream server failure;
FIG. 42 is a flow chart of a program routine executed by the controller server having master status for performing transparent failover of a stream; and
FIG. 43 is a flow chart of a program routine executed by the controller server having master status for performing, under client control, failover of a stream.
While the invention is susceptible to various modifications and alternative forms, a specific embodiment thereof has been shown in the drawings and will be described in detail. It should be understood, however, that it is not intended to limit the invention to the particular form shown, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the scope of the invention as defined by the appended claims.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
I. The Architecture of the Video File Server
Turning now to FIG. 1 of the drawings, there is shown a video file server generally designated 20 incorporating the present invention. The video file server 20 includes an array of stream servers 21, at least one control server 28, 29, a cached disk array storage subsystem 23, and an optional tape silo 24. The video file server 20 is a high performance, high capacity, and high-availability network-attached data server. It provides the ability for multiple file systems to exist concurrently over multiple communication stacks, with shared data access. It also allows multiple physical file systems to co-exist, each optimized to the needs of a particular data service.
The video file server 20 is managed as a dedicated network appliance, integrated with popular network operating systems in a way, which, other than its superior performance, is transparent to the end user. It provides specialized support for isochronous data streams used in live, as well as store-and forward, audio-visual applications. Therefore, the video file server 20 is suitable for a wide variety of applications such as image repositories, video on demand, and networked video applications, in addition to high-end file server applications such as the Network File System (NFS, version 2 and version 3) (and/or other access protocols), network or on-line backup, fast download, etc. NFS is a well-known IETF file access protocol standard (RFC 1094, Sun Microsystems, Inc., "NFS: Network File System Protocol Specification," Mar. 1, 1989). NFS acts as a network server for network communications by providing basic file access operations for network clients. Such basic file access operations include opening a file, reading a file, writing to a file, and closing a file.
The clustering of the stream servers 21 as a front end to the cached disk array 23 provides parallelism and scalability. The clustering of random-access memory in the stream servers 21 provides a large capacity cache memory for video applications.
Each of the stream servers 21 is a high-end commodity computer, providing the highest performance appropriate for a stream server at the lowest cost. The stream servers 21 are mounted in a standard 19" wide rack. Each of the stream servers 21, for example, includes and Intel processor connected to a EISA or PCI bus and at least 64 MB of random-access memory. The number of the stream servers 21, their processor class (i486, Pentium, etc.) and the amount of random-access memory in each of the stream servers, are selected for desired performance and capacity characteristics, such as the number of concurrent users to be serviced, the number of independent multi-media programs to be accessed concurrently, and the desired latency of access to the multi-media programs, as will be further described below.
Each of the stream servers 21 contains one or more high-performance FWD (fast, wide, differential) SCSI connections to the back-end storage array. Each of the stream servers 21 may also contain one or more SCSI connections to the optional tape silo 24. Each of the stream servers 21 also contains one or more outbound network attachments configured on the stream server's EISA or PCI bus. The outbound network attachments, for example, are Ethernet, FDDI, ATM, DS1, DS3, or channelized T3
attachments to data links to a network (25 in FIG. 2). Each of the stream servers 21 also includes an additional Ethernet connection to a dual redundant internal Ethernet link (26 in FIG. 2) for coordination of the stream servers with each other and with one or more controller servers 28, 29.
The controller servers 28, 29 shown in FIG. 2 are dual redundant computers 28, 29, each of which is similar to each of the stream servers 21. Each of the dual redundant controller servers 28, 29 has a network attachment to a bidirectional link (30 in FIG. 2) in the network (25 in FIG. 2), through which each of the controller servers 28, 29 can conduct service protocols. The service protocols include one or more standard management and control protocols such as SNMP (RFC 1157, M. Schoffstall, M. Fedor, J. Davin, J. Case, "A Simple Network Management Protocol (SNMP)," May 10, 1990), and at least one Continuous Media File Access Protocol supporting isochronous real-time multi-media data transmission from the stream servers 21 to the network (25
in FIG. 2).
Each of the dual redundant controller servers 28, 29 has an Ethernet connection to the local Ethernet link 26. Each of the controller servers 28, 29 also has a connection to a serial link 31 to a media server display and keyboard 32. The controller servers 28, 29 run a conventional operating system (such as Windows NT or UNIX) to provide a hot-failover redundant configuration. An active one of the dual redundant controller servers 28, 29 functions as a media server controller for the video file server 20. The active one of the controller servers 28, 29 also allows management and control of the server resources from the network using standard protocols, such as the Simple Network Management Protocol (SNMP). SNMP is an internet protocol that permits inspection and modification of system variables such as the network address (IP) and the number of buffers for network communication. The active one of the controller servers 28, 29 may also provide lock management if lock management is not provided by the cached disk array 23.
For multi-media data transfer, the active one of the controller servers 28, 29 assigns one of the stream servers 21 to the network client 54 requesting multi-media service. The network 25, for example, has conventional switching mechanisms, such as an ATM switch 53 or arrays of cross-bar switches, that permit any one of the clients 54 to communicate with any one of the stream servers 21. The active one of the controller servers 28, 29 could assign a stream server to a network client by a protocol sending to the client the network address of the stream server assigned to send or receive data to or from the client. Alternatively, the active one of the controller servers 28, 29 could communicate with a switching mechanism such as the ATM switch 53 to establish a data link between the client and the stream server assigned to the client.
The cached disk array 23 is configured for an open systems network environment. Preferably the cached disk array 23 is a Symmetrix 5500 (Trademark) cached disk array manufactured by EMC Corporation, 171 South Street, Hopkinton, Mass.,
01748-9103.
Turning now to FIG. 2, there is shown a block diagram of the video file server 20 including the SCSI connections 40 among the cached disk array 23, the optional tape silo 24, the controller servers 28, 29, and the stream servers 21. The cached disk array 23 includes a large capacity semiconductor cache memory 41 and SCSI adapters 45 providing one or more FWD SCSI links to each of the stream servers 21 and to each of the dual redundant controller servers 28, 29.
The tape silo 24 includes an array of SCSI adapters 50 and an array of read/write stations 51. Each of the read/write stations 51 is connected via a respective one of the SCSI adapters 50 and a FWD SCSI link to a respective one of the stream servers 21 or each of the redundant controller servers 28, 29. The read/write stations 51 are controlled robotically in response to commands from the active one of the controller servers 28, 29 for tape transport functions, and preferably also for mounting and unmounting of tape cartridges into the read/write stations from storage bins.
In a preferred mode of operation, to archive data from a file from the network to tape, one of the stream servers 21 receives the file from the network 25 and prestages the file to the cached disk array 23 at a high rate limited by the network transmission rate (about 150 GB/hour). Then one of the stream servers 21 destages the file from the cached disk array 23 to an associated one of the read/write stations 51 at a tape device speed (about 7 GB/hour). For most applications, prestaging to disk can be done immediately, and staging from disk to tape including sorting of files onto respective tape cassettes can be done as a background operation or at night, when the load on the video server is at a minimum. In this fashion, the cached disk array 23 can absorb a high data inflow aggregation from tens or hundreds of network links streaming from multiple sites, and balance this load on the read/write stations 41. Prestaging to the cached disk array allows better use of the read/write stations 51, matching of server flow to tape streaming flow, and reduction of tape and read/write station wear. Prestaging to the back-end also allows multiple classes of backup and restore services, including instant backup for files maintained on disk in the cached disk array, and temporary batch backup pending a success or failure acknowledgment. Prestaging to the cached disk array 23 also makes economical an on-line archive service performing the staging from the cached disk array to tape as a background process.
Turning now to FIG. 3, there is shown a more detailed block diagram of the cached disk array 23. The cache memory 41 is composed of dynamic RAM cards mating with a dual redundant back-plane system bus 42. The cached disk array 23 also includes micro-processor cards that mate with the back-plane system bus 42 and are programmed to function as channel directors 43 or disk directors 44. Each of the channel directors 43 is interfaced through one of a number of SCSI adapters 45 to the SCSI interface of one of the stream servers 21. Each of the disk directors 44 is interfaced through at least one of a number of disk adapters 46 connected to a string of commodity FBA (fixed-block architecture) disk drives 47. The channel directors 43
access data in the cache memory 41 in response to a request from its associated stream server. If data to be read by a channel director are not found in cache memory, one of the disk directors 44 and disk adapters 46 transfers or "stages" the data from the disk array 47 to the cache memory 41. In a background process, the disk directors 44 and disk adapters 45 also write-back data from the cache memory 41 to the disk array 47, after the channel directors write data to the cache memory 41. In addition to providing intermediate storage for the data transferred between the channel directors 43 and the disk directors 44, the cache memory 41 also provides intermediate storage for control information transferred among the channel directors and disk directors.
The bus 42 is preferably the back-plane of a printed-circuit card-cage or main-frame in the cached disk array 23, and each of the channel directors 43 and disk directors 44 is constructed on a printed circuit board that is mounted in the card-cage or main-frame. The channel director and disk director boards are further described in Yanai et al. U.S. Pat. No. 5,335,352, issued Aug. 2, 1994, and entitled Reconfigurable, Multi-Function Disc Controller, incorporated herein by reference. The cache memory 13 is constructed on a number of additional printed circuit boards that are mounted in the card-cage or main-frame. Further details regarding the construction and operation of the cached disk array 23 are disclosed in Yanai et al., U.S. Pat. No. 5,206,939, issued Apr. 27, 1993; and Yanai et al. U.S. Pat. No. 5,381,539, issued Jan. 10, 1995; all incorporated herein by reference.
II. The Video File Server Software
Turning now to FIG. 4, there is shown a block diagram of software 60 providing a real-time processing environment in the video file server (20 of FIGS. 1 and 2). The software 60 is executed by the processors of the stream servers 21. The software 60 also provides an environment for managing files services and multiple high-performance data streams as well as a standard set of service-level application program interfaces (APIS) for developing and porting file service protocols (such as NFS).
In the processors of controller servers 28, 29, a software application is run by a general purpose operating system such as Microsoft NT, and a network client communicates service requests to the video file server only through the software application executing on an active one of the controller servers 28, 29. This software application executes as a central control to prevent the video file server from performing conflicting operations in response to concurrent requests from various network clients. For example, the video file server should not erase a file for one client while data from the file is being streamed to another client.
The software 60 includes a file system 61 for controlling transfer of data between the network 25 and the disk array (47 in FIG. 2) or tape silo (24 in FIGS. 1 and 2). A buffer cache 62 composed of part of the random-access memory of the stream servers 21 is used as a buffer for this data transfer.
The software 60 also includes a kernel program 63 for providing a real-time scheduler and an access control program for arbitrating among conflicting service requests. The kernel program 63 separates control information (file access and synchronization protocols) from the underlying data stream. The application software running on an active one of the controller servers 28, 29 includes an admission control program. The kernel program 63 includes a real-time scheduler. The admission control program running on the active one of the controller servers 28, 29 applies an admission control policy to determine whether a service request can be satisfied, and if so, sends the stream servers 21 appropriate control messages that invoke their real-time schedulers to schedule operations to satisfy the service request. The admission control policy considers the global resources available to satisfy the request, including the current loading of the stream servers 21, the cached disk array 23, and the optional tape silo 24. If the request requires an operation of a stream server 21, one of the stream servers is selected to perform the required operation, and the active one of the controller servers 28, 29 transmits an associated operational command over the local Ethernet (26 in FIG. 2) to the selected stream server. Each of the stream servers 26 includes a real-time scheduler to schedule the local operations required to satisfy an operational command from the active one of the controller servers 28, 29. Preferably, one or more of the stream servers 21 are kept in a standby mode, to be used as "hot spares" or replacements for any one of the other stream servers that fails to acknowledge commands from the active one of the controller servers 28, 29 or is otherwise found to experience a failure.
The software 60 further includes an SNMP management agent 64 supporting a Simple Network Management Protocol. SNMP is a standard internet protocol for inspecting and changing system variables. For example, the SNMP management agent is used when an operator at the media server display and keyboard (32 in FIG. 1) sets the network IP address of the video server (20 in FIG. 1).
Turning now to FIG. 5, there is shown a more detailed block diagram of the software structure 60. The file system 61 in FIG. 4 has been expanded into its components. These components are a common file system 71, a group of software modules providing communication between the common file system and the network, and a group of software modules providing communication between the common file system and the cached disk array 23 or tape silo 24. The common file system 71 uses the Virtual File System (VFS), which is an industry-standard back-end file system switch, to interface with the physical file systems 79. VFS translates NFS Common File System requests, and permits NFS access to CMFS movie files for editing. (The NFS Common File System Requests in themselves are translations of NFS requests to the intended physical file storage devices. NFS is one of the file access protocols 75.) The common file system 71 accesses the buffer cache 62 during data transfers between the network (25) and disk or tape storage (23, 24).
The group of software modules providing communication between the common file system and the network includes file access protocols 75 and a network server interface 73 using communication stacks 74 and network link drivers 72. The file access protocols 75 include a set of industry standard network server protocols such as NFS, as well as protocols for audio/video services, such as CMFAP. CMFAP is a continuous media file access protocol which provides functions such as opening a movie, playing a movie, stop play of a movie, and "fast forward" and "fast reverse" functions. Other file access protocols compatible with the network 25 could also be used, such as Novell NCP, LanManager, SMB, etc.
The file access protocols 75 are layered between the communication stacks 74 and the common file system 71. The communication stacks 74 provide the network access and connectivity for the data transmitted to the file access protocol layer 75
from the network link drivers 72. The communication stacks include TCP/IP, IPX/SPX, NETbeui, or others. The network server framework 73 allows porting of the network software and file access protocols 72, 74, 75. This framework 73 is System V Streams. There could be multiple concurrent instances of the file access protocols 75, communication stacks 74, and drivers 73.
The group of software modules providing communication between the common file system and the cached disk array 23 or tape silo 24 includes physical file systems 79 and SCSI CAM 76 which provides a standard framework (SCSI Common Access Method) to the SCSI bus drivers 77. The physical file systems 79 include a continuous media file system (CMFS) and at least one conventional industry standard-based file system such as the Unix ufs file system. Other industry standards-based file systems could also be used, such as VxFS, ISO9660, etc. The buffer cache 62 buffers data passed between the SCSI drivers 77 and the physical file system 79. There could be multiple concurrent instances of the network drivers 72, communication stacks 74, file access protocols 75, SCSI drivers 77, and physical file systems 79.
FIG. 6 is a specific example of software modules of FIG. 5. Two physical file systems are exported onto the network: a conventional UNIX File System (UFS) and a Continuous Media File System (CMFS). CMFS is a component of a software package available from EMC Corporation, 171 South Street, Hopkinton, Mass., 01748-9103. CMFS may be mounted on a directory within the UFS hierarchy, or it may be mounted on the root directory `/` as a stand-alone root file system. Both UFS and CMFS are exported onto the network using NFS. The file system switch that directs client NFS requests to the intended physical file system is implemented using a standard virtual file-system (Vnode/VFS) interface.
In addition to NFS, the file server supports a real-time Continuous Media File Access Protocol (CMFAP) for accessing CMFS. CMFAP provides a VCR-like functionality that includes commands to Play, Record, Pause, Restart, and Rewind. CMFAP also supports a set of management commands for opening and closing streams, listing all active streams, and redirecting an active playback stream to an alternate display destination. CMFAP may not be used for accessing UFS, but only for accessing CMFS.
The design of CMFS is guided by the following assumptions: (1) the majority of files in a video-on-demand system are large, on the order of a few hundred megabytes to a few tens of gigabytes; (2) access patterns are predominantly read-only; that is most files are accessed for real-time playback; and (3) most files are complete in that they contain interleaved audio and video, as opposed to having related audio and video data stored in two separate files. These assumptions suggested an extent-based approach to the design of CMFS on-disk structures. An extent-based file system allocates file space in large contiguous disk chunks called extents; the size of an extent is a system parameter. Extents of an appropriately chosen size promote file contiguity, simplify disk space management, and are well suited for large files. File contiguity benefits performance in the environment where most files are accessed for read-only, which is a design assumption. Assuming that most files contain interleaved audio and video, there is no need to leave gaps between blocks in anticipation of filling the gaps with frames of a related stream.
CMFS may span several disks. All disks that comprise CMFS are collectively called the CMFS volume set 80. When a new CMFS file is created, it is written to the disk that contains more free blocks than any other disk within the volume set. The reason for multi-disk volume sets is to increase capacity rather than provide load balancing. Load balancing may be accomplished by exporting multiple file systems.
Each disk in the CMFS volume set is divided into two areas: the data area and the inode area. The data area is used to store file data, while the inode area is used to store inodes that hold file metadata. In addition to the standard file metadata information, the inode contains an array of extent descriptors that locate each extent comprising the corresponding file. An extent descriptor may also point to an inode located on another disk. Such a descriptor is used to point to a continuation inode when a CMFS file spans multiple disks.
The file server software runs as an embedded system that includes a real-time kernel (63 in FIGS. 4 and 5). The main components of the kernel are a task scheduler, frameworks for writing device drivers, and a number of system services that are commonly found in similar real-time kernels. The system services include kernel interfaces to memory management, timers, synchronization, and task creation.
All kernel tasks run in a single unprotected address space. As a result of this, no copy operations are required to move data from disk to the network. Copying is eliminated by passing references to common buffers across all subsystems. Considerable efficiency is obtained for the video-on-demand service because of the elimination of copy operations by the processor. The only "incremental" work involved in transmitting a frame is due to cycle stealing by the DMA devices for moving data to and from memory. As a result, the predominant component of the service time for transmission of a frame is fixed, even though the size of the frame may vary, depending on the compression algorithm. The kernel exploits the fixed service time per frame in the scheduling and admissions control policy that is described below.
Even a simple video file server that provides playback only needs to receive data from the network and store it on disk. This happens when loading movies from the network. When data are received from the network, a single copy operation is used to move data from the network to the disk. Although the service time for receiving a frame varies according to the frame size, the service time for a network fragment of the frame is fixed (because of a fixed MTU packet size). The fixed per packet service time is used in the scheduling and admissions control policy for real-time tasks that receive network data.
III. The Kernel Scheduler
The kernel 63 uses the scheduler and admission control policy described in K. K. Ramakrishnan et al., "Operating System Support for a Video-On-Demand File Service," Multimedia Systems, Vol. 3, Springer-Verlag, 1995, pp. 53-65.
Three classes of schedulable tasks are supported: general-purpose, real-time, and isochronous tasks. These classes correspond to different kinds of requests that are likely to exist in a video-on-demand system. Real-time and isochronous tasks are known in the real-time literature as aperiodic and periodic tasks, respectively.
The design of the CPU scheduler is based on a combination of weighted round-robin and rate monotonic scheduling procedures. Tasks within the isochronous class are scheduled using a rate-monotonic procedure, while the real-time and general-purpose tasks are scheduled using the weighted round-robin scheme. The isochronous class is given the highest priority; that is, any task within the isochronous class always pre-empts a real-time or a general-purpose task.
Turning now to FIG. 7, there is shown a high level view of the three classes of schedulable tasks; namely, the general-purpose tasks 81, the real-time tasks 82, and the isochronous tasks 83.
The general-purpose class supports pre-emptible tasks that are suitable for low-priority background processing. In order to ensure that general-purpose tasks can always make progress, this class is granted a minimum CPU processing quantum.
The general-purpose class is implemented as a standard threads package, with a thread corresponding directly to a general-purpose task as described herein. A suitable threads package is described in A. D. Birrell, "An Introduction to Programming with Threads," Systems Research Center Technical Report, No. 35, Digital Equipment Corporation, Maynard, Mass., (1989).
The real-time class is suitable for tasks that require guaranteed throughput and bounded delay. Real-time tasks are not pre-emptible; however, a software provision is made to allow for the existence of safe "preemption windows" in which all isochronous tasks can be executed. A weight and a scheduling flag is assigned to every real-time task. The weight is used as the means to limit the amount of processing time taken by the real-time task at each invocation. The scheduling flag is used to indicate that the task has pending work and to signal the scheduler that the task needs to be invoked. The scheduling flag may be set by an interrupt service routine or a task of any class.
In the video file server, real-time tasks are used to implement "polling" device drivers and communication stacks. The method of polling for pending work, as opposed to interrupt-driven processing, contributes to system stability and alleviates most of the problems that arise during overloads. It also provides isolation between multiple real-time tasks that have differing performance requirements. Polling regulates the flow of traffic into the video file server. Just as flow control mechanisms, such as a leaky bucket scheme, protect network resources from large bursts, polling protects the end-system resources by regulating the frequency at which work queues are scanned and limiting the amount of work that may be performed during each scan of the round-robin schedule.
The real-time tasks are implemented as callable routines. Invoking a real-time task amounts simply to a procedure call.
The isochronous class supports real-time periodic tasks that require performance guarantees for throughout, bounded latency, and lower jitter. Low jitter reduces the amount of buffering needed at the client, which in turn improves the response time of interactive video applications. The isochronous tasks that support streams of different periods are assigned priorities (w1, w2, w3, etc.) on a rate-monotonic basis (i.e., a task with a higher frequency has a higher priority). Isochronous tasks also allow for a safe "preemption window" in which all higher priority isochronous tasks can be executed. Isochronous tasks are used to schedule periodic network transmission of audio and video frames. An isochronous task executes exactly once per period. In the preferred implementation, a single isochronous task services all client streams that have the same frame rate.
The scheduler executes isochronous tasks from a "Ready" queue 84 in which all isochronous tasks that are ready to run are arranged in order of decreasing priority (a task with the lowest period has the highest priority and resides at the head of the queue). An isochronous task is inserted in its appropriate place on the "Ready" queue 84 upon arrival. The arrival of isochronous tasks is generated by period timers. A unique periodic timer exists in the system for each distinct period among all the admitted isochronous tasks.
Whenever an isochronous task arrives, the scheduler determines whether a currently running task needs to be pre-empted. If the currently running task is a general-purpose task, it is pre-empted by the newly arrived isochronous task. If the currently running task is a real-time task, it will be pre-empted by the newly arrived isochronous task in the next "preemption window". If the currently running task is of the isochronous class, the scheduler compares its priority to that of the task currently at the head of the "Ready" queue 84. If the priority of the current task is lower, it is pre-empted at the next "preemption window" by the isochronous task from the head of the queue. The scheduler continues to execute isochronous tasks until the isochronous "Ready" queue 84 becomes empty. Whenever the queue is empty, the scheduler alternates between the real-time and general-purpose classes using a weighted round-robin scheme.
Selecting a real-time task involves scanning the set of scheduling flags 85; for each flag that is set, the scheduler invokes the corresponding task with the assigned weight as a parameter. The real-time task is expected to process at most the number of work units equal to the task's weight that was passed to it as a parameter. At the completion of each unit of work, the real-time task opens up the "preemption window" which is used by the scheduler to run all the isochronous tasks that may have arrived in the time it took the real-time task to process one unit of work. Upon exhausting the allowed number of work units (the weight) or less, the task voluntarily returns to the scheduler. After having completed one round of scanning the flags, the scheduler switches to the general purpose class.
General purpose tasks that are ready for execution are placed on a "GP ready" queue 86, which in our current implementation is served in a round-robin fashion. If the "GP ready" queue 86 is empty, the scheduler initiates a new round of servicing the real-time tasks. Otherwise, the scheduler starts the general-purpose quantum timer, and activates the first task from the "GP ready" queue 86. The task runs until it blocks or the quantum timer expires. If the task blocks, its context is saved on a wait queue 87 and the next task from the "GP ready" queue 86 is restored for execution. If the quantum timer expires, the scheduler saves the context of the currently running task at the end of the "GP ready" queue 86 and switches to a new round of servicing the real-time tasks. The execution of the general-purpose tasks may be preempted one or more times by the isochronous tasks. The execution of the general-purpose class continues after each preemption until the total time spent in processing general-purpose tasks reaches the guaranteed quantum.
In the absence of isochronous tasks, the scheduler can provide guarantees on throughput and delay bounds for real-time tasks (this assumes that all requests destined for a real-time task generate a constant amount of work). A maximum service delay is the time it takes to complete one round of real-time tasks scheduling plus the general purpose time quantum. Let R denote this maximum service delay in steady state. Weights may be assigned to real-time tasks to allocate and guarantee bandwidth averaged over the maximum service delay, R. If W denotes the weight given to a real-time task (the number of units of this task, or requests, processed in one round), then the task's steady state throughput is (W/R) requests per unit time.
An admission control policy is employed in order to ensure that a feasible schedule exists for all the admitted tasks; that is, all the admitted tasks can be scheduled using the combination of rate monotonic and weighted round-robin scheduling procedure described above without violating any performance guarantees. The admission control policy for access to processor resources balances the needs of the three classes of tasks: throughput and maximum delay requirements of the real-time tasks, a minimum guaranteed CPU quantum for the general-purpose tasks, and the periodic deadline-sensitive nature of the isochronous tasks. The admission control policy uses a time-based admission test for rate monotonic (isochronous) tasks with an adjustment to account for tolerable delay constraints imposed by the real-time tasks, with an adjustment to account for tolerable delay constraints imposed by the real-time tasks. Let L.sub.r denote the maximum delay that can be tolerated by any of the real-time tasks. Then a feasible schedule exists for a set of n isochronous tasks and m real-time tasks if the following two conditions hold true: ##EQU1## where C.sub.i run-time requirement of isochronous task i
T.sub.i the period of isochronous task i
W.sub.j weight assigned to real-time task j
r.sub.j run-time required by the real-time task j to process one request
Q time quantum assigned to the general-purpose class, i.e., GP class runs Q units of time every time interval of length L.sub.r
As noted above, C.sub.i is a fixed time per execution of isochronous task i. In the second step a test must be applied to each isochronous task i to ensure that its execution requirements can be fulfilled in the presence of all higher priority isochronous tasks. The test is as follows
It is convenient to describe the disk scheduling and admission control for access to storage devices by viewing the video file server operating in steady state. The steady state operation the video file server consists of servicing n streams at the rate of R.sub.i bytes/second for each stream (i.e., R.sub.l is the ith stream's playback rate). For each stream the video file server maintains two buffers: a disk buffer and a network buffer. In steady state, a network task empties the network buffer and a disk task fills up the disk buffer. The two operations are performed in parallel. The rate at which the network buffer is emptied needs to be equal to the rate at which the disk buffer is filled up; the goal is that both rates are the same as the stream's playback rate. When the network buffer is empty, the disk buffer is full. At that moment the buffers interchange their roles. The disk buffers are filled up for all the streams in a round-robin fashion. One round of filling up the disk buffers of all streams is known as the disk round-robin service time. We assume that disk transfers are not pre-emptible.
The admission control policy needs to ensure that the steady state operation of the video file server, as described above, is feasible. A new stream can be admitted if the following three conditions are satisfied. First, the rate at which the disk buffers are filled is greater or equal to the rate at which the network buffers are emptied. Second, sufficient buffer space exists for allocating disk and network buffers to all admitted streams, including the newly admitted stream. And third, the disk service time for all the streams does not exceed the minimum tolerable request latency. Request latency is the amount of time that elapses from the moment the server receives a request for the first frame of a stream until the moment the first frame is placed on the network. This is required in order to support interactive video applications, such as games.
The first condition is expressed by the following constraint: ##EQU3## where R.sub.i bytes/second is the playback rate of stream i and D.sub.min bytes/second is the minimal disk rate, including seek times, at which n disk buffers can be filled. It may be computed as follows ##EQU4## where R.sub.d bytes is the amount of contiguous data that the disk can transfer in 1 second, (without any seeks involved), and S.sub.max is the maximum disk seek time. It is assumed that in between servicing each stream, the disk has to perform a maximum seek.
The second condition is expressed by the following constraint: ##EQU5## where B.sub.i is the size of the disk buffer allocated to stream i, and M is the total amount of system memory from which the disk buffers are allocated. An equivalent amount of memory is available from which network buffers are allocated. B.sub.i bytes is the amount of data transferred from disk for session i during one round of the round-robin service for the admitted streams. Strategies for choosing an appropriate size for disk buffers are discussed below.
The third condition is expressed as follows: ##EQU6## where T denotes the maximum time taken by one round of filling up the disk buffers of all the streams (i.e., T is the sum of the disk service times for all streams in one round), B.sub.i and D.sub.min are given by equations (2) and (3), and L is the smallest among the maximum request latencies tolerated by any of the streams.
While describing conditions 2 and 3 for the admission control, we referred to B.sub.i, the size of a disk buffer allocated to stream i, without specifying how this size is chosen. In this section we discuss two strategies for choosing the disk buffer sizes, which is equivalent to determining the amount of data that should be transferred from the disk for each session during one round.
The "optimal strategy" is the one in which the amount of data transferred from disk for each stream is proportional to the stream's playback rate. The constant of proportionality is the disk service time for one round. The strategy is described as follows. Let M bytes denote the total amount of system memory from which the disk buffers are allocated for all streams. Then the maximum time taken by one round of filling up the disk buffers of all the streams is ##EQU7## where D.sub.min is the same as in equation (2). T is used as the constant of proportionality for sizing the disk buffers. The rate at which buffers are filled is (.SIGMA.B.sub.i)/T . The rate at which network buffers are drained is .SIGMA.R.sub.i. The simple constraint therefore is (.SIGMA.B.sub.i)/T.gtoreq..SIGMA.R.sub.i. This is simplistically satisfied for each stream if B.sub.i =T R.sub.i, where B.sub.i is the size of the disk buffer and the size of the disk read block for stream i, and R.sub.i is the stream's playback rate.
Thus, each stream consumes its network buffer in time T which is the exact amount of time needed by the round-robin service to fill up the disk buffers for all the streams. If any stream i reads more than its computed buffer size B.sub.i, then the round-robin time will take longer than T, causing some streams to starve. Similarly, if a stream i reads less than its computed buffer size B.sub.i, then additional seeks are introduced, causing unnecessary overhead for that stream and reducing D.sub.min. Thus, the chosen disk buffer size B.sub.i must be optimal for each stream.
Unfortunately, the optimal strategy suffers from two practical limitations. First, the disk round-robin service time T needed to compute each B.sub.i, depends on the number of currently active streams (that is, D.sub.min depends on n in (2)). Thus, T varies each time a new stream is admitted, or a previously active stream terminates. In order to comply with the optimal strategy during such transitions, it is necessary to re-size the disk buffers and readjust the amount of data that is read from the disk for each stream. Dynamically re-sizing the disk buffers may not be practical from an implementation point of view.
The second limitation of the optimal strategy is that a large amount of buffer space M may lead to an unreasonably large size of some disk buffer B.sub.i. It is unreasonable in the sense that it could greatly exceed the practical size for a disk read request. In this case, the disk buffer B.sub.i would need to be filled up by several disk reads, possibly resulting in an unpredictable number of disk seeks, if the file is not entirely contiguous.
The second strategy is designed to overcome the practical limitations inherent in the `optimal strategy`. In this "practical strategy" we impose a constraint that B.sub.i does not exceed B.sub.max, where B.sub.max is chosen to be a reasonable size for a disk read request. The disk buffer sizes are still allocated in proportion to the playback rate as follows: ##EQU8## where R.sub.max is the maximum playback rate, assumed to be known a priori.
This strategy, although practical for the purposes of implementation, is suboptimal in the theoretical sense that it will admit fewer streams than the "optimal strategy".
The disk scheduling and admission control procedures described above ensure that the playback rates of "real time" streams are satisfied. "Real-time" streams are those streams that require guaranteed response (served both by isochronous and real-time tasks). However, the real-time streams may not consume the entire disk bandwidth. In this case, it is desirable to specify a procedure by which non real-time disk requests (such as NFS) can receive the unused disk bandwidth without interfering with the real-time disk access requests.
A simple case is the one in which the playback of each stream occurs at a constant bit-rate. This situation arises when the video is recorded in its original uncompressed form (frame sizes are constant) or when the video is compressed at a constant bit-rate (MPEG I, for example). In the case of a constant playback rate, all real-time disk requests may be issued to the disk exactly at the beginning of every interval of length T (T is the worst case round-robin service time as computed in the previous section). Let k denote the number of active real-time streams. Then the number of real-time requests that may be issued to the disk every T period is n-k, where n is the maximum number of streams supported by the system, as was described in the previous section. The non real-time requests may be issued at any time within the interval T, as long as the round time to service k real-time streams plus the data transfer time of the non real-time requests does not exceed T.
A more complicated case arises when the playback of each stream occurs at a variable bit-rate (such as in motion JPEG, for example). In this case the admission control policy makes a conservative admission decision based on the assumption that the playback rate for each stream proceeds at a constant frame rate using the stream's maximum frame size. Variations from the maximum frame size, however, are used to accommodate non real-time requests, as described below. Since the network buffer empties at a variable rate, it is not possible to issue all the real-time disk requests at the beginning of every period of length T, as was the case with the constant playback rate. Each stream issues a disk read request as and when its network buffer becomes empty. Thus disk requests arrive at various times. For each real-time stream we maintain a sorted queue of the estimated time of arrival (ETA) of the next read request. As shown in the timing diagram of FIG. 8, the queue is sorted in increasing time order. Notice from FIG. 8 that a non real-time disk read may be issued in the slack time--an interval whose end points are now and the first ETA on the queue (ETA for session i).
Initially, the ETAS are computed based on draining the network buffer at the maximum rate. However, as each variable-sized frame is transmitted, its deviation from the maximum frame size is used to adjust the ETA of the corresponding stream. The adjustment involves moving the ETA forward in time, since the network buffer will hold data longer than the original worst case estimate based on the maximum frame size. The adjustment potentially increases the interval (slack time) in which the non-real time disk requests may be issued.
A drawback of the procedure described above is that its implementation may become computationally expensive since it involves sorting a potentially long queue of ETA entries. Therefore, an alternative procedure is considered for accommodating non real-time requests. The alternative procedure retains the ability of the previous procedure to accommodate non real-time requests during "slack" periods, while substantially reducing its computational complexity.
In the alternative procedure, some portion of the disk bandwidth is permanently allocated to non real-time requests. Let us denote this bandwidth in terms of the number of non real-time requests m that may be issued to the disk during each interval T (T is the worst case round-robin service time as computed in the previous section). Thus each interval of length T is allocated m credits for issuing non real-time requests. The procedure considers two cases: one in which a non real-time request arrives when credits are still available (m>0), and the other in which a request arrives when no credits are left (m=0).
In the first case (m>0), a request is issued to the disk and the number of credits for this interval is decremented by one. If the request completes in the same interval in which it was issued and the number of credits reaches zero, then the number of credits for this interval is incremented by one. If the request completes in the interval following the one in which it was issued, then the number of credits in this new interval is decremented by one.
In the second case (m=0), a credit is borrowed from the next interval, provided that the number of credits available for the next interval is greater than zero. A request issued on a borrowed credit always completes in the interval following the one in which it was issued, otherwise credits would have been available in the current interval. If the request completes before any of the real-time requests need to be issued in the new interval, then the borrowed credit is returned to the current interval (this is the interval from which the credit was borrowed previously).
The basic difference between the two procedures is that in the alternative procedure it is required to reserve a portion of the disk bandwidth for non real-time requests. While the previous procedure accommodates non real-time requests during the "slack" periods only, the alternative procedure accommodates these requests both during "slack" times and "reserved" times. The alternative procedure is more compatible with our CPU scheduling policy which guarantees progress to non real-time requests.
It may also be possible to accommodate non real-time requests simply by using two priority queues: a low priority for non real-time requests and a high priority for real-time requests. In order for such a scheme to work correctly, it is necessary to implement the priority queues at all levels including the lowest level that maintains queued disk requests, such as the disk adapter or the driver level. This scheme also requires that some portion of the disk bandwidth be reserved for non real-time requests.
IV. Prefetching to Service MultiDle Video Streams
One advantage to the video server architecture of FIG. 2 is that multiple video streams requested by multiple network clients can sometimes be serviced from the cache memory 41 of the cached disk array 23 without always fetching the video data from the disk array 47. This situation is illustrated in FIGS. 9 and 10.
In FIG. 9, video data are transmitted isochronously to a first network client from a buffer 91 in random access memory (RAM) in a first one of the stream servers (21 in FIG. 2). The buffer 91 is filled by data fetched from the cache 41 of the cached disk array (23 in FIG. 2). The cache 41 is filled by data prefetched from the disk array 47.
Turning now to FIG. 10, there is shown a flowchart of a prefetch task including steps for scheduling the transmission of video prefetch commands from one of the stream servers (21 in FIG. 2) to the cache disk array (23 in FIG. 2). As indicated for a first step 101, the video prefetch commands are used when the object being accessed by the stream server is a movie. If so, then in step 102 the stream server finds the next segment for the movie. The media server controller, for example, accesses a movie directory to obtain a list of the addresses of the movie segments in the cached disk array and the size or length of each segment, and transmits this list to the stream server as the object to be accessed. In step 102, the stream server obtains from this list the next segment address and the size of the next segment. Then in step 103 the stream server compares the size of this segment to a predetermined number N which is a limit on the amount of data to be prefetched in response to a single video prefetch command. If the segment size is greater than the number N, then in step 104 only a beginning portion of size N of this segment is prefetched by issuing a video prefetch command to the cached disk array (23 in FIG. 2); the rest of this segment is prefetched in one or more subsequent iterations beginning again in step 103. Otherwise, in step 105, the entire segment is prefetched by issuing a video prefetch command to the cached disk array (23 in FIG. 2). After steps 104 or 105, in step 106 execution branches to step 107 if the end portion of the segment has not been prefetched. In step 107 the segment size is reduced by N, in effect truncating the prefetched portion of the segment. After step 107, the prefetch task is suspended until it is time for the next video prefetch command (issued in steps 104 or 105), and then execution loops back to step 103 to continue prefetching the remaining portion of the segment. Otherwise, at the end of the segment, in step 109 the prefetching task is ended if there are no more segments of the movie to prefetch. If there are more segments of the movie to prefetch, in step 110, the prefetch task is suspended until it is time to prefetch the next segment.
There is a fetch task that is similar to the prefetch task shown in FIG. 10, except that a video fetch command instead of a video prefetch command is issued in the fetch task steps corresponding to steps 104 and 105. The time for the next fetch command is established by the requirement of isochronous video data delivery to the network client having requested the video data. Data are fetched sufficiently in advance of the required time for isochronous video delivery to the network client. The time for the next prefetch operation is established by synchronization between the prefetching of the movie with the fetching of the movie. Data are prefetched sufficiently in advance of its fetch time to guarantee that the data are in the cache of the cached disk array when the cached disk array receives the fetch command.
Turning now to FIG. 11, there is shown a flowchart of a video prefetch routine performed by the cached disk array in response to a video prefetch command from a stream server. The video prefetch routine ensures that data specified by the video prefetch command will be in the cache of the cached disk array at the time that the cached disk array receives a subsequent fetch command from the stream server. The execution of a video prefetch routine differs from a conventional cached disk array synchronous prefetch operation by ensuring that the video prefetch routine is executed on a high priority basis, and by ensuring that the prefetched video data are retained in the cache of the cached disk array until the subsequent prefetch command is serviced.
In a first step 121, the cached disk array channel director (43 in FIG. 3) having received the prefetch command identifies the next track in the video segment being prefetched. Next, in step 122, a cache directory in the cache memory (41 in FIG.
3) is inspected to determine whether the track is in the cache memory. If not, then in step 123, a cache slot is allocated to receive the track by removing the cache slot from the head of a "replacement queue" that keeps track of the "least recently used" cache slot or otherwise implements a replacement algorithm for the cache of the cached disk array. After step 123, in step 124, the track is staged from the disk array 47 and loaded into the cache slot.
If the track is found to be in the cache in step 122, or after the track is staged into the cache from disk in step 124, then in step 125 the requesting process is placed on a wait list for the track. In this fashion, the track can be retained in the cache until it is fetched by the process. In step 126 a time stamp for the track could also be reset to the current time, and used by a background process in the cached disk array to determine whether any track has been retained in the cache for any inordinate amount of time due to a failure of the process to fetch the video data from the cache. Upon finding that a track has been retained in the cache for an inordinate amount of time, the background process would return the cache slot to the head of the replacement queue and report to the video server manager that the process or processes on the wait list have experienced an error.
In a final step 126, execution loops back to step 121 if there are any more tracks in the video segment that need to be prefetched. If not, execution returns.
Turning now to FIG. 12, there is shown a flowchart of a video fetch routine executed by a channel director (43 in FIG. 3) of the cached disk array in response to a video fetch command from a stream server. In a first step 131, the channel director identifies the next track in the video segment to be fetched. Then in step 132, the channel director accesses the directory in the cache memory (41 in FIG. 3) to determine whether data of the track is in the cache and to determine the cache slot containing the data of the track. If the track is not in the cache, then presumably an error has occurred, because each video fetch command specifying a video segment should have been preceded by a video prefetch command specifying the same video segment, and the video prefetch command should have been executed prior to receipt of the video fetch command. Otherwise, in step 133, the data of the track are transferred from the cache slot to a channel director buffer. Next, in step 134, the data are transferred from the channel director buffer to the stream server having issued the fetch command, and in step 135, the process of the stream server having issued the fetch command is removed from the wait list for the cache slot.
In step 136, execution branches depending on whether the wait list is empty. If so, then in step 137, the cache slot is inserted at the head of the replacement queue, so that the cache slot can be used for receiving data staged from another track. After step 137, or when the wait list is not empty, execution continues to step 138. In step 138, execution loops back to step 131 if there are any more tracks in the segment to be fetched. If not, the video fetch routine is done, and execution returns.
If data prefetched from the disk array (47 in FIG. 3) is to be used only by a single network client, then it is desirable to minimize the amount of memory space allocated in the cache 41 and in the stream server buffer 91 for storing the data. This is done by scheduling the fetch operation no more in advance of the delivery of the data to the network client than is necessary to guarantee that the fetched data will be available in the stream server buffer 91 at the scheduled time for delivery of the data to the network client, and scheduling the prefetch operation no more in advance of the delivery of the data from the cache 41 than is necessary to guarantee that prefetched data will be available in the cache when the fetch operation attempts to fetch the data from the cache.
If data prefetched from the disk array (47 in FIG. 3) will be used by multiple network clients, then it may be desirable to allocate more than the minimum amount of memory for storing the data in the cache of the cached disk array or in the stream server buffer. For example, the amount of memory to allocate for a movie-on-demand request could be an increasing function of the popularity of the movie.
FIG. 13 shows a situation where data prefetched from the disk array 47 and stored in the cache 41 is used by more than one network client. In this situation, the same data previously fetched for the first network client is fetched from the cache
41 and transferred to a buffer 92 in RAM of a second one of the stream servers (21 in FIG. 2) and transmitted to a second network client. The loading on the disk array 47 is reduced because data are not prefetched from the disk array 47 separately and independently for each video stream. Instead, the data prefetched from the disk array 47 and stored in the cache of the cached disk array are shared between the two video streams through the two stream server buffers 91, 92 to the two network clients. This is a consequence of the fact that in the video prefetch routine of FIG. 11, if the data are already in the cache, then the data need not be staged from the disk array.
In the situation of FIG. 13, it may be desirable to schedule the prefetch operation further in advance of the delivery of the data from the cache 41 than is necessary to guarantee that prefetched data will be available in the cache 41 when the fetch operation attempts to fetch the data from the cache 41. It may be desirable to perform such advanced scheduling if the advanced scheduling would reduce the load on the disk array. The load on the disk array would be reduced if at the time of the advanced prefetch for the second network client, the data would reside in the cache of the cached disk array from a prefetch for a first network client. However, by scheduling prefetch far in advance, more cache memory resources would be allocated to servicing the second network client.
In general the desirability of advanced prefetch scheduling is function of the loading on the disk array 47, the loading or free memory capacity of the cache 41, the occurrence or probability of multiple fetch operations being needed to access the same movie, and the relative position or time difference of the different fetch operations on the same movie. In particular, advanced prefetching will not help unless there will be more than one prefetch operation on the same movie. The relative position or time difference between two prefetch operations on the same stream determines the amount of cache memory needed to eliminate additional disk accesses to support an additional one of the streams. Therefore, if the video file server would receive a request for supporting a new stream on a movie, it could decide whether or not to perform advanced prefetching, and to determine how far in advance to prefetch, in dependence on whether the video file server is already providing another network client with a video stream from the same movie, and the relative position or time difference in the movie between the newly requested stream and the closest existing stream. This time difference would set the cache memory requirements to support the new stream without requiring additional disk accesses. If the cache memory is available and it is less costly overall in system resources to support the new stream with cache memory instead of disk accesses, then advanced prefetching by an amount related to the time difference should be performed.
Turning now to FIG. 14, there is shown a first portion of a flowchart of a routine for computing the prefetch advance time (T.sub.A) for supporting a video stream of a new request for an "on demand" movie. Such a routine could be part of the admission policy of the kernel (63 in FIG. 5) of the video server manager. In a first step 141, execution branches depending on whether the new request is for the same movie as an existing stream.
If the new request is not for the same movie as an existing stream, then there is no need for advanced prefetching. In step 142, the prefetch advance time (T.sub.A) is set to the minimum time T.sub.MIN. Then in step 143, the kernel checks whether the minimum cache resources are available to support a new stream. If not, then the new request is rejected. Otherwise, in step 144, the kernel checks whether disk resources are available to support a new stream. If not, then the new request is rejected. Otherwise, execution continues in step 145 in FIG. 15. In step 145, the prefetch advance of the new request is set to T.sub.A, and the new request is accepted.
If the new request is for the same movie as an existing stream, then execution continues in FIG. 14 from step 141 to step 146. In step 146, the kernel finds the existing stream having a fetch or pre-fetch time closest in the movie to the fetch time for the new request. In step 147, execution branches depending on whether or not the new request is behind this stream in the movie. If the new request is not behind this existing stream, then in step 148 the kernel computes the time difference (T.sub.A) between the fetch time for the new request and the prefetch time for the existing stream. If the new request is behind this existing stream, then in step 149 the kernel computes the time difference (T.sub.A) between the fetch time of the existing stream and the fetch time of the new request. After step 148 or 149, execution continues in step 150 of FIG. 15.
In step 150 of FIG. 15, the kernel checks whether cache resources are available to support the caching of the movie for the computed time difference (T.sub.A). If not, then in step 151 the kernel checks whether disk resources are available to support a new stream. If not, then the request is rejected. If disk resources are available, then execution continues from step 151 to step 152. In step 152, the time difference (T.sub.A) is set to the minimum value (T.sub.MIN). Then in step 153, the kernel checks whether cache resources are available to support the caching of the movie for this minimum time. If not, then the new request is rejected. Otherwise, execution continues to step 145, where the prefetch advance of the new request is set to T.sub.A, and the request is accepted.
If in step 150, there are sufficient cache resources available, then execution continues to step 154, where execution branches depending on whether or not disk resources are available to support the new stream. If disk resources are available, then execution continues from step 154 to step 155, where the relative cost of the disk resources for supporting the requested video stream without advanced prefetching is compared to the relative cost of the cache resources for supporting the requested stream with advanced prefetching. For example, the relative cost of the disk resources for supporting the requested video stream without advanced prefetching could be expressed as the percentage of the required disk resources out of presently unused amount of disk resources, and the relative cost of the cache resources for supporting the requested stream with advanced prefetching could be expressed as a percentage of the required cache resources out of the presently unused amount of cache resources. If the relative cost of disk resources does not exceed the relative cost of cache resources, then execution continues from step 155 to step 152. Otherwise, execution branches from step 155 to step 156. Execution also branches to step 156 from step 154
when disk resources are not available to support the new request.
In step 156 execution branches to step 157 if the new request is behind the existing stream in the movie. In this case, in step 157, there is scheduled temporary prefetching for the new request, advanced by T.sub.MIN, to terminate at a time T.sub.A in the future. This temporary prefetching is scheduled to support the new stream until the time that the new stream caches up to the data having been staged into the cache for the existing stream. After step 157, execution continues to step
145, where the prefetch advance of the new request is set to T.sub.A, and the new request is accepted.
When the new request is ahead of the existing stream in the movie, execution continues from step 156 to step 158, where the prefetch advance of the new request is set to the minimum value T.sub.MIN. Then in step 159, the existing prefetching for the existing stream is scheduled to terminate in the future at a time of T.sub.A from the present time, and more advanced prefetching for the existing stream (advanced by an additional time of T.sub.A) is begun for the existing stream. In this fashion, the new request is accepted.
V. Staggered Stream Support for Video On Demand
The method of sharing prefetched data in the cache of the cached disk array to support more than one video stream as illustrated in FIG. 13 can be further adapted to permit sharing of fetched data in the RAM of a stream server to support more than one video stream from the RAM of the stream server. For video "on demand" service for popular movies, however, it is advantageous to initially allocate large amounts of random access memory of the stream servers to the popular movies, in order to reduce loading on the cache and the disk array. Such allocation of the server RAM to the popular movies ensures that each popular movie needs a minimum amount of cache and disk array resources.
Turning now to FIG. 16, there is shown a schematic diagram illustrating the preferred method of allocating server RAM to a popular movie. In the example in FIG. 16, a block of data for a third of a movie is stored in the RAM of each of four stream servers 91, 92, 93, and 94. In this example, there is a significant amount of overlap between the video data stored in the RAM of the four stream servers in order to simplify scheduling.
Preferably the block of data in the RAM of each of the four stream servers 91, 92, 93 and 94 is a sliding "window" into the movie. New data are added to each window, and old data are removed from each window, at the rate at which data are delivered to the network clients viewing the movie. The block of data providing such a sliding window, for example, is maintained as a simple circular queue. In this fashion, there is no need to re-allocate the network clients to different stream server PCs while a client is viewing a movie in an uninterrupted fashion. However, if a client would request a stop, fast-forward, or fast-reverse operation, it may be necessary to re-allocate a network client to a different stream server PC. In these cases, however, some delay would be acceptable before the client could resume the viewing of the movie. If a stop, fast-forward or fast-reverse operation takes the client's viewing out of the window, then the client's continued viewing of the movie can be treated similar to a new request.
The minimum number of stream server PCs required for supporting each movie according to the method of FIG. 16 is determined as follows. First, each movie needs a certain amount of RAM memory for storing the entire movie, plus a certain minimum amount of window overlap. The amount of RAM memory for storing a movie depends on the length of the movie (such as 90 minutes to 120 minutes) and the bit-rate (megabits per second) at which the encoded movie has been delivered; this rate is typically a function of the method by which the video data are encoded (such as MPEG I or MPEG II).
Second, each stream server PC can be configured with a maximum amount of RAM available as a buffer memory. This maximum amount of memory may limit the size of the window on a single stream server PC. The number of stream server PCs required for storing an entire movie in RAM is computed by dividing the total amount of RAM buffer memory needed for an entire movie (plus required overlap) by the amount of maximum RAM buffer memory of a single stream server PC, and rounding up to a whole number.
Third, each stream server PC can service only a limited number of video streams to the network clients. Given a certain maximum number of anticipated video streams, the minimum number of stream server PCs required for servicing this given number video streams is computed by dividing this given number by the number of video streams that can be serviced by each stream server PC, and rounding up to a whole number.
Finally, the minimum number of stream server PCs required in the system to support a singl