United States Patent5140592
Idleman , ; et al.August 18, 1992

Title

Disk array system

Abstract

A method and apparatus for controlling data flow between a computer and a group of memory devices arranged in a particular logical configuration. The system includes a group of first level controllers and a group of second level controllers. The first level controllers and the second level controllers work together such that if one of the second level controllers fails, the routing between the first level controllers and the memory devices is switched to a properly functioning second level controller without the need to involve the computer in the rerouting process. The logical configuration of the memory devices remain constant. The invention also includes switching circuitry which permits a functioning second level controller to assume control of a group of memory devices formerly primarily contolled by the failed second level controller. In addition, the invention provides error check and correction as well as mass storage device configuration circuitry.


Inventors:Idleman; Thomas E. (Santa Clara, CA), Koontz; Robert S.  (Atherton, CA), Powers; David T.  (Morgan Hill, CA), Jaffe; David H.  (Belmont, CA), Henson; Larry P.  (Santa Clara, CA), Glider; Joseph S.  (Palo Alto, CA), Gajjar; Kumar  (San Jose, CA)
Assignee:SF2 Corporation (Sunnyvale, CA)
Appl. No.:601482
Filed:October 22, 1990

Current U.S. Class:714/5 714/7 
Field of Search:371/8.1,8.2,9.1,10.1,11.1,11.2 364/245.3,245.4

U.S. Patent Documents
3303482February 1970Jenkins
3544777December 1970Winkler
3693159September 1972Hilberg
3772652November 1973Hilberg
3803560April 1974DeVoy et al.
3905023September 1975Perpiglia
3917933November 1975Scheuneman et al.
4070704January 1978Calle et al.
4093985June 1978Das
4207609June 1980Luiz et al.
4339804July 1982Davison et al.
4342079July 1982Stewart et al.
4464747August 1984Groudan
4467421August 1984White
4468731August 1984Johnson et al.
4507730March 1985Johnson et al.
4667326May 1987Young et al.
4722085January 1988Flora et al.
4761785August 1988Clark et al.
4786193August 1988Takamae
4814982March 1989Weir
4817035March 1989Timsit
4825403April 1989Gershenson et al.
4849929July 1989Timsit
4899342February 1990Potter et al.
4914656April 1990Dunphy et al.
Foreign Patent Documents
1233087., 0000GB
1497680., 0000GB
266789., 0000EP
369707., 0000EP
56-163596., 1981JP
56-88549., 1981JP
56-94593., 1981JP
57-111890., 1982JP
57-111893., 1982JP
57-169297., 1982JP
57-195397., 1982JP
58-83400., 1983JP
60-156152., 1985JP
61-99999., 1986JP
Other References
Blum, "Fast Access Disk File with Several Parallel Heads", IBM Technical Disclosure Bulletin, vol. 25, No. 6, Nov. 1982. .
W. Jilke, "Disk Array Mass Storage Systems: The New Opportunity," Amperif Corporation, Sep. 30, 1986. .
W. Jilke, "Economics Study of Disk Array Mass Storage Systems: The Cost Reduction Opportunity," Amperif Corporation, Mar. 24, 1987. .
Michelle Y. Kim, "Synchronized Disk Interleaving," IEEE Transactions on Computers, vol. C-35 No. 11, Nov. 1986. .
D. Lieberman, "SCSI-2 Controller Board Builds Parallel Disk Drive Arrays," Computer Design, vol. 28, No. 7, Apr. 1, 1989, pp. 32, 36. .
W. Meador, "Disk Array Systems," Spring COMPCON 89 Digest of Papers, IEEE Computer Society Press, pp. 143-146. .
T. Olson, "Disk array Performance in a Random IO Environment," Computer Architecture, vol. 17, No. 5, Sep. 1989, pp. 71-77. .
D. Patterson et al., "A Case for Redundant Arrays of Inexpensive Disks (RAID)," Report No. UCB/CSD 87/391, Dec. 1987. .
Product Description, Micropolis 1804 SCSI Parallel Drive Array, Document No. 108120 Rev A. .
Program Summary, DataStorage 86, An International Forum, Sep. 22-24, 1986, Red Lion Inn, San Jose, California. .
H. Sierra, "Assessing the Promise of Disk Arrays," Canadian Datasystems, May 1989, pp. 52-53. .
D. Simpson, "RAIDs vs. SLEDs." Systems Integration, Nov. 1989, pp. 70-82. .
Mike Sisley, "Microprogram Development Technique Adds Flexibility," New Electronics, vol. 17, No. 23 Nov. 27, 1984, pp. 35-38. .
J. Voelker, "Winchester Disks Reach for a Gigabyte," IEEE Spectrum, Feb. 1987, pp. 64-67..~
Primary Examiner: Atkinson; Charles E.
Attorney, Agent or Firm:Townsend & Townsend

Parent Case Text



CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation-in-part of Ser. Nos. 07/505,622, 07/506,703, and 07/488,749, filed Apr6, 1990, Apr. 6, 1990, and Mar. 2, 1990, respectively.

Claims


What is claimed is:
1. A system for storing data received from an external source, comprising:
at least two control means for providing control of data flow to and from the external source;
a plurality of storage means coupled to said at least two control means wherein said storage means are divided into groups and each group is controlled by at least two of said control means, each of said control means including means for determining if the other control means has failed and means for transfering control of a particular group of storage means to said control means from the failed control means, such that in the case that a first control means coupled to a particular group of storage means fails, control of said particular group is assumed by a second control means;
a plurality of data handling means coupled to said at least two control means for diassembling data into data blocks to be written across a group of said storage means;
error detection means coupled to said control means and said storage means for calculating at least one error detection term for each group of storage means based on the data received from the external source using a selected error code and providing said error detection term to be compared with data to detect errors, said error detection means being coupled to each of said control means to receive the data from said control means and transmit said error detection term to an error code storage means in said group of storage means; and
a plurality of buffer means coupled to a first bus and to said control means for buffering data received by and transmitted from the system;
wherein said control means further comprises:
a plurality of switching means, each switching means being coupled both to each storage means and each buffer means for providing switchable control of any of said storage means; and
switch control means coupled to each switching means for controlling said switching means to allow data to flow from a selected buffer means to a selected storage means and to flow from said selected storage means to said selected buffer means.

2. A system for storing data received from an external source, comprising:
at least two control means for providing control of data flow to and from the external source;
a plurality of storage means coupled to said at least two control means wherein said storage means are divided into groups and each group is controlled by at least two of said control means, each of said control means including means for determining if the other control means has failed and means for transfering control of a particular group of storage means to said control means from the failed control means, such that in the case that a first control means coupled to a particular group of storage means fails, control of said particular group is assumed by a second control means;
a plurality of data handling means coupled to said at least two control means for disassembling data into data blocks to be written to said storage means;
error detection means coupled to said control means and said storage means for calculating at least one error detection term based on the data received from the external source using a selected error code and providing said error detection term to be compared with data to detect errors, said error detection means being coupled to each of said control means to receive the data from said control means and transmit said error detection term to an error code storage means in said group of storage means; and
a plurality of buffer means coupled to a first bus and to said control means for buffering data received by and transmitted from the system;
wherein said error detection means routes data from any selected buffer means to any selected storage means and from any selected storage means to any selected buffer means.

3. A system for storing data received from an external source, comprising:
at least two control means for providing control of data flow to and from the external source;
a plurality of storage means coupled to said at least two control means wherein said storage means are divided into groups and each group is controlled by at least two of said control means, each of said control means including means for determining if the other control means has failed and means for transfering control of a particular group of storage means to said control means from the failed control means, such that in the case that a first control means coupled to a particular group of storage means fails, control of said particular group is assumed by a second control means;
a plurality of data handling means coupled to said at least two control means for disassembling data into data blocks to be written across a group of said storage means;
error detection means coupled to said control means and said storage means for calculating at least one error detection term for each group of storage means based on the data received from the external source using a selected error cod and providing said error detection term to be compared with data to detect errors, said error detection means being coupled to each of said control means to receive the data from said control means and transmit said error detection term to an error code storage means in said group of storage means;
wherein each of said plurality of control means comprises:
a plurality of first-level means for handling input/output storage communication with the external source; and
a plurality of second-level means, each connected to each of said plurality of first-level means and at least one partner second-level means, for providing a data path from said first-level means to each of a primary group of said storage means, and for providing a data path to a secondary group of said storage means.

4. The system of claim 3 wherein each of said plurality of second-level means is configured to maintain a particular logical configuration of said plurality of storage means transparent to the external source in case a failure occurs in said partner second-level means by providing a data path to a secondary group of said storage means formerly primarily controlled by said at least one partner second-level means.

5. The system of claim 3 further comprising first communication lines connected between each of said plurality of second-level means such that when one of said second-level means fails, said one second-level means informs an other second-level means that a failure has occurred so that said other second-level means can assume primary control of a subset of storage means formerly primarily controlled by said one second-level means.

6. The system of claim 3 further comprising a switching function connected between said plurality of second-level means and said plurality of storage means for transferring control form a first second-level means upon failure to a second-level means.

7. The system of claim 3 wherein each second-level means further comprises a second-level means recovery system for issuing signals to and receiving signals from said first-level means and said at least one partner second-level means.

8. The system of claim 3 wherein each second-level means further comprises a state machine capable of maintaining said second-level controller in a number of states, each state representing a system configuration wherein said second-level means controls particular storage means.

9. A system for storing data received from an external source, comprising:
at least two control means for providing control of data flow to and from the external source;
a plurality of storage means coupled to said at least two control means wherein said storage means are divided into groups and each group is controlled by at least two of said control means, each of said control means including means for determining if the other control means has failed and means for transfering control of a particular group of storage means to said control mean from the failed control means, such that in the case that a first control means coupled to a particular group of storage means fails, control of said particular group is assumed by a second control means;
a plurality of data handling means coupled to said at least two control means for diassembling data into data blocks to be written across a group of said storage means; and
error detection means coupled to said control means and said storage means for calculating at least one error detection term for each group of storage means based on the data received from the external source using a selected error code and providing said error detection term to be compared with data to detect errors, said error detection means being coupled to each of said control means to receive the data from said control means and transmit said error detection term to an error code storage means in said group of storage means;
wherein said plurality of storage means are operatively interconnected to function at a first logical level as a plurality of redundancy groups, each of said redundancy groups including at a second logical level at least one data group, each data group capable of operating as a separate logical storage device.

10. The system of claim 9 wherein each redundancy group comprises a plurality of data groups.

11. The system of claim 9 further comprising a third logical level, wherein at least one data group from each of at least two of the plurality of redundancy groups are combined to form a single logical mass data storage device.

12. The system of claim 9 wherein for at least one of the plurality of redundancy groups redundancy is provided by an error detecting and correcting code, the code words of which are stored in at least one check drive included in at least one redundancy group.

13. The system of claim 12 wherein each of the at least one check drives is a particular storage means.

14. The system of claim 12 wherein each of the at least one check drive is a logical mass storage device comprising portions of a plurality of said storage means.

15. The system of claim 9 wherein for at least one of the plurality of redundancy groups redundancy is provided by mirroring.

16. The system of claim 9 wherein redundancy data is stored in at least one redundant mass storage means included in the redundancy group.

17. In a system including at least two control means for communicating with an external source and a plurality of storage means wherein at least two of the control means are connected to each of the storage means, a method for storing data received from the external source comprising the steps of:
receiving data from the external source;
configuring the plurality of storage means into groups wherein each group is initially controlled by at least two of the control means such that in the case that one of the control means fails, the storage means of each group is accessible through another one of the control means;
disassembling data into groups of data blocks to be written to said plurality of storage means;
calculating at least one error detection term from said data using a selected error code;
storing said data blocks in a first of said groups of storage means;
storing said at least one error detection term in said first of said groups of storage means;
detecting a failure in a first second-level controller in said control means; and
switching a data path from aid first second-level controller to a second second-level controller transparently to the external source such that the storage means formerly controlled by said first second-level controller are maintained in communication with the external source.

18. In a system including at least two control means for communicating with an external source and a plurality of storage means wherein at least two of the control means are connected to each of the storage means, a method for storing data received from the external source comprising the steps of:
receiving data from the external source;
configuring the plurality of storage means into groups wherein each group is initially controlled by at least two of the control means such that in the case that one of the control means fails, the storage means of each group is accessible through another one of the control means;
disassembling data into groups of data blocks to be written to said plurality of storage means;
calculating at least one error detection term from said data using a selected error code;
storing said data blocks in a first of said groups of storage means;
storing said at least one error detection term in said first of said groups of storage means; and
configuring said storage means to function at a first logical level as a plurality of redundancy groups, each of said redundancy groups including at a second logical level at least one data group wherein each data group is capable of operating as a separate logical storage device.

19. The method of claim 18 wherein each redundancy group comprises a plurality of data groups.

20. The method of claim 18 further comprising the step of configuring the storage means with a third logical level, wherein at least one data group from each of at least two of the plurality of redundancy groups are combined to form a single logical mass storage device.

21. The method of claim 18 wherein for at least one of the plurality of redundancy groups redundancy is provided by an error detecting and correcting code, the code words of which are stored in at least one check storage means included in at least one redundancy group.

22. The method of claim 21 wherein each of the at least one check storage means is a particular storage means.

23. The method of claim 21 wherein each of the at least one check storage means is a logical mass storage device comprising portions of a plurality of the storage means.

24. The method of claim 21 wherein for at least one of the plurality of redundancy groups redundancy is provided by mirroring.

25. The method of claim 21 wherein redundancy data is stored in at least one redundant storage means included in the redundancy group.

26. A system for storing data received from an external source, comprising:
control means for providing control of data flow to and from the external source;
a plurality of storage means coupled to said control means wherein said storage means are divided into groups;
a plurality of data handling means coupled to said control means for disassembling data into data blocks to be written to said storage means;
error detection means coupled to said control means for receiving said data blocks in parallel form and detecting errors in each data block substantially simultaneously as said data blocks are written to said storage means;
wherein said control means further comprises:
a plurality of switching means, each switching means being coupled both to each storage means and each buffer means for providing switchable control of between any of said storage means and any of said buffer means; and
switch control means coupled to each switching means for controlling said switching means to allow data to flow from a selected buffer means to a selected storage means and to flow from said selected storage means to said selected buffer means.

27. The system of claim 26 wherein said error detection means routes data from any selected buffer means to any selected storage means and from any selected storage means to any selected buffer means.

28. A system for storing data received form an external source, comprising:
control means for providing control of data flow to and from the external source;
a plurality of storage means coupled to said control means wherein said storage means are divided into groups;
a plurality of data handling means coupled to aid control means for disassembling data into data blocks to be written to said storage means;
error detection means coupled to said control means for receiving said data blocks in parallel form and detecting errors in each data block substantially simultaneously as said data blocks are written to said storage means;
wherein said control means comprises:
a plurality of first-level means for handling input/output storage communication with the external source; and
a plurality of second-level means, each connected to each of said plurality of first-level means and at least one partner second-level means, for providing a data path from said first-level means to each of a primary group of said storage means, and for providing a data path to a secondary group of said storage means.

29. The system of claim 28 wherein each of said plurality of second-level means is configured to maintain a particular logical configuration of said plurality of storage means transparent to the external source in case a failure occurs in said partner second-level means by providing a data path to a secondary group of said storage means formerly primarily controlled by said at least one partner second-level means.

30. The system of claim 28 further comprising first communication lines connected between each of said plurality of second-level means such that when one of said second-level means fails, said one second-level means informs an other second-level means that a failure has occurred so that said other second-level means can assume primary control of a subset of storage means formerly primarily controlled by said one second-level means.

31. The system of claim 28 further comprising a switching function connected between said plurality of second-level means and said plurality of storage means for transferring control from a first second-level means upon failure to a second second-level means.

32. The system of claim 28 wherein each second-level means further comprises a second-level means recovery system for issuing signals to and receiving signals from said first-level means and said at least one partner second-level means.

33. The system of claim 28 wherein each second-level means further comprises a state machine capable of maintaining said second-level controller in a number of states, each state representing a system configuration wherein said second-level means controls particular storage means.

34. A system for storing data received from an external source, comprising:
control means for providing control of data flow to and from the external source;
a plurality of storage means coupled to said control means wherein said storage means are divided into groups;
a plurality of data handling means coupled to said control means for disassembling data into data blocks to be written to said storage means; and
error detection means coupled to said control means for receiving said data blocks in parallel form and detecting errors in each data block substantially simultaneously as said data blocks are written to said storage means;
wherein said plurality of storage means are operatively interconnected to function at a first logical level as a plurality of redundancy groups, each of said redundancy groups including at a second logical level at least one data group, each data group capable of operating as a separate logical storage device.

35. The system of claim 34 wherein each redundancy group comprises a plurality of data groups.

36. The system of claim 34 further comprising a third logical level, wherein at least one data group from each of at least two of the plurality of redundancy groups are combined to form a single logical mass data storage device.

37. The system of claim 34 wherein for at least one of the plurality of redundancy groups redundancy is provided by an error detecting and correcting code, the code words of which are stored in at least one check drive included in at least one redundancy group.

38. The system of claim 37 wherein each of the at least one check drives is a particular storage means.

39. The system of claim 37 wherein each of the at least one check drive is a logical mass storage device comprising portions of a plurality of said storage means.

40. The system of claim 34 wherein for at least one of the plurality of redundancy groups redundancy is provided by mirroring.

41. The system of claim 34 wherein redundancy data is stored in at least one redundant mass storage means included in the redundancy group.

42. In a system including control means for communicating with an external source and a plurality of storage means, a method for storing data received from the external source comprising the steps of:
receiving data from the external source;
disassembling the data into groups of data blocks to be written to said plurality of storage means;
calculating at least one error detection term for each data block substantially simultaneously;
storing said data blocks and at least one error detection term in a first of said groups of storage means substantially simultaneously;
detecting a failure in a first second-level controller in said control means; and
switching a data path from said first second-level controller to a second second-level controller transparently to the external source such that the storage means formerly controlled by said first second-level controller are maintained in communication with the external source.

43. In a system including control means for communicating with an external source and a plurality of storage means, a method for storing data received from the external source comprising the steps of:
receiving data from the external source;
disassembling the data into groups of data blocks to be written to said plurality of storage means;
calculating at least one error detection term for each data block substantially simultaneously;
storing said data blocks and at least one error detection term in a first of said groups of storage means substantially simultaneously; and
configuring said storage means to function at a first logical level as a plurality of redundancy groups, each of said redundancy groups including at a second logical level at least one data group wherein each data group is capable of operating as a separate logical storage device.

44. The method of claim 43 wherein each redundancy group comprises a plurality of data groups.

45. The method of claim 43 further comprising the step of configuring the storage means with a third logical level, wherein at least one data group from each of at least two of the plurality of redundancy groups are combined to form a single logical mass storage device.

46. The method of claim 43 wherein for at least one of the plurality of redundancy groups redundancy is provided by an error detecting and correcting code, the code words of which are stored in at least one check storage means included in at least one redundancy group.

47. The method of claim 46 wherein each of the at least one check storage means is a particular storage means.

48. The method of claim 46 wherein each of the at least one check storage means is a logical mass storage device comprising portions of a plurality of the storage means.

49. The method of claim 46 wherein for at least one of the plurality of redundancy groups redundancy is provided by mirroring.

50. The method of claim 46 wherein redundancy data is stored in at least one redundant storage means included in the redundancy group.

Description

BACKGROUND OF THE INVENTION

The present invention relates generally to memory storage devices. More particularly, the invention is a method and apparatus for interfacing an external computer to a set of storage devices which are typically disk drives.

Magnetic disk drive memories for use with digital computer systems are known. Although many types of disk drives are known, the present invention will be described as using hard disk drives. However, nothing herein should be taken to limit the invention to that particular embodiment.

Many computer systems use a plurality of disk drive memories to store data. A common known architecture for such systems is shown in FIG. 1. Therein, computer 10 is coupled by means of bus 15 to disk array 20. Disk array 20 is comprised of large buffer 22, bus 24, and a plurality of disk drives 30. The disk drives 30 can be operated in various logical configurations. When a group of drives is operated collectively as a logical device, data stored during a write operation can be spread across one or more members of an array. Disk controllers 35 are connected to buffer 22 by bus 24. Each controller 35 is assigned a particular disk drive 30.

Each disk drive within disk drive array 20 is accessed and the data thereon retrieved individually. The disk controller 35 associated with each disk drive 30 controls the input/output operations for the particular disk drive to which it is coupled. Data placed in buffer 22 is available for transmission to computer 10 over bus 15. When the computer transmits data to be written on the disks, controllers 35 receive the data for the individual disk drives from bus 24. In this type of system, disk operations are asynchronous in relationship to each other.

In the case where one of the controllers experiences a failure, the computer must take action to isolate the failed controller and to switch the memory devices formerly under the failed controller's control to a properly functioning other controller. The switching requires the computer to perform a number of operations. First, it must isolate the failed controller. This means that all data flow directed to the failed controller must be redirected to a working controller.

In the system described above, it is necessary for the computer to be involved with rerouting data away from a failed controller. The necessary operations performed by the computer in completing rerouting requires the computer's attention. This places added functions on the computer which may delay other functions which the computer is working on. As a result, the entire system is slowed down.

Another problem associated with disk operations, in particular writing and reading, is an associated probability of error. Procedures and apparatus have been developed which can detect and, in some cases, correct the errors which occur during the reading and writing of the disks. With relation to a generic disk drive, the disk is divided into a plurality of sectors, each sector having the same, predetermined size. Each sector has a particular header field, which gives the sector a unique address, a header field code which allows for the detection of errors in the header field, a data field of variable length and ECC ("Error Correction Code") codes, which allow for the detection and correction of errors in the data.

When a disk is written to, the disk controller reads the header field and the header field code. If the sector is the desired sector and no header field error is detected, the new data is written into the data field and the new data ECC is written into the ECC field.

Read operations are similar in that initially both the header field and header field error code are read. If no header field errors exist, the data and the data correction codes are read. If no error is detected the data is transmitted to the computer. If errors are detected, the error correction circuitry located within the disk controller tries to correct the error. If this is possible, the corrected data is transmitted. Otherwise, the disk drive's controller signals to the computer or master disk controller that an uncorrectable error has been detected.

In FIG. 2 a known disk drive system which has an associated error correction circuit, external to the individual disk controllers, is shown. This system uses a Reed-Solomon error detection code both to detect and correct errors. Reed-Solomon codes are known and the information required to generate them is described in many references. One such reference is Practical Error Correction Design for Engineers. published by Data Systems Technology Corp., Broomfield, Colo. For purposes of this application, it is necessary to know that the Reed-Solomon code generates redundancy terms, herein called P and Q redundancy terms, which terms are used to detect and correct data errors.

In the system shown in FIG. 2, ECC 42 unit is coupled to bus 45. The bus is individually coupled to a plurality of data disk drives, numbered here 47, 48, and 49, as well as to the P and Q term disk drives, numbered 51 and 53 through Small Computer Standard Interfaces ("SCSIs") 54 through 58. The American National Standard for Information Processing ("ANSI") has promulgated a standard for SCSI which is described in ANSI document number X3.130-1986.

Bus 45 is additionally coupled to large output buffer 22. Buffer 22 is in turn coupled to computer 10. In this system, as blocks of data are read from the individual data disk drives, they are individually and sequentially placed on the bus and simultaneously transmitted both to the large buffer and the ECC unit. The P and Q terms from disk drives 51 and 53 are transmitted to ECC 42 only. The transmission of data and the P and Q terms over bus 45 occurs sequentially. The exact bus width can be any arbitrary size with 8-, 16- and 32-bit wide buses being common.

After a large block of data is assembled in the buffer, the calculations necessary to detect and correct data errors, which use the terms received from the P and Q disk drives, are performed within the ECC unit 42. If errors are detected, the transfer of data to the computer is interrupted and the incorrect data is corrected, if possible.

During write operations, after a block of data is assembled in buffer 22, new P and Q terms are generated within ECC unit 42 and written to the P and Q disk drives at the same time that the data in buffer 22 is written to the data disk drives.

Those disk drive systems which utilize known error correction techniques have several shortcomings. In the systems illustrated in FIGS. 1 and 2, data transmission is sequential over a single bus with a relatively slow rate of data transfer. Additionally, as the error correction circuitry must wait until a block of data of predefined size is assembled in the buffer before it can detect and correct errors therein, there is an unavoidable delay while such detection and correction takes place.

As stated, the most common form of data transmission in these systems is serial data transmission. Given that the bus has a fixed width, it takes a fixed and relatively large amount of time to build up data in the buffer for transmission either to the disks or computer. If the large, single buffer fails, all the disk drives coupled thereto become unusable. Therefore, a system which has a plurality of disk drives which can increase the rate of data transfer between the computer and the disk drives and more effectively match the data transfer rate to the computer's maximum efficient operating speed is desirable. The system should also be able to conduct this high rate of data transfer while performing all necessary error detection and correction functions and at the same time provide an acceptable level of performance even when individual disk drives fail.

Another failing of prior art systems is that they do not exploit the full range of data organizations that are possible in a system using a group of disk drive arrays. In other words, a mass storage apparatus made up of a plurality of physical storage devices may be called upon to operate as a logical storage device for two concurrently-running applications having different data storage needs. For example, one application requiring large data transfers (i.e., high bandwidth), and &he other requiring high frequency transfers (i.e., high operation rate). A third application may call upon the apparatus to provide both high bandwidth and high operating rate. Known operating techniques for physical device sets do not provide the capability of dynamically configuring a single set of physical storage devices to provide optimal service in response to such varied needs.

It would therefore be desirable to be able to provide a mass storage apparatus, made up of a plurality of physical storage devices, which could flexibly provide both high bandwidth and high operation rate, as necessary, along with high reliability.

SUMMARY OF THE INVENTION

The present invention provides a set of small, inexpensive disk drives that appears to an external computer as one or more logical disk drives. The disk drives are arranged in sets. Data is broken up and written across the disk drives in a set, with error detection and correction redundancy data being generated in the process and also being written to a redundancy area. Backup disk drives are provided which can be coupled into one or more of the sets. Multiple control systems for the sets are used, with any one set having a primary control system and another control system which acts as its backup, but primarily controls a separate set. The error correction or redundancy data and the error detection data is generated "on the fly" as data is transferred to the disk drives. When data is read from the disk drives, the error detection data is verified to confirm the integrity of the data. Lost data from a particular disk drive can be regenerated with the use of the redundancy data and a backup drive can be substituted for a failed disk drive.

The present invention provides an arrangement of disk drive controllers, data disk drives and error correction code disk drives, the drives being each individually coupled to a small buffer memory and a circuit for error detection and correction. A first aspect of the present invention is error detection and correction which occurs nearly simultaneously with the transfer of data to and from the disk drives. The multiple buffer memories can then be read from or written to in sequence for transfers on a data bus to the system computer. Additionally, the error correction circuitry can be connected to all of the buffer memory/disk drive data paths through a series of multiplexer circuits called cross-bar ("X-bar") switches. These X-bar switches can be used to decouple failed buffer memories or disk drives from the system.

A number of disk drives are operatively interconnected sc as to function at a first logical level as one or more logical redundancy groups. A logical redundancy group is a set of disk drives which share redundancy data. The width, depth and redundancy type (e.g., mirrored data or check data) of each logical redundancy group, and the location of redundant information therein, are independently configurable to meet desired capacity and reliability requirements. At a second logical level, blocks of mass storage data are grouped into one or more logical data groups. A logical redundancy group may be divided into more than one such data group. The width, depth, addressing sequence and arrangement of data blocks in each logical data group are independently configurable to divide the mass data storage apparatus into multiple logical mass storage areas each having potentially different bandwidth and operation rate characteristics.

A third logical level, for interacting with application software of a host computer operating system, is also provided. The application level superimposes logical application units on the data groups to allow data groups, alone or in combination from one or more redundancy groups, to appear to application software as single logical storage units.

As data is written to the drives, the error correction circuit, herein called the Array Correction Circuit ("ACC"), calculates P and Q redundancy terms and stores them on two designated P and Q disk drives through the X-bar switches. In contrast to the discussed prior art, the present invention's ACC can detect and correct errors across an entire set of disk drives simultaneously, hence the use of the term "Array Correction Circuit." In the following description, the term ACC will refer only to the circuit which performs the necessary error correction functions. The codes themselves will be referred to as Error Correction Code or "ECC". On subsequent read operations, the ACC may compare the data read with the stored P and Q values to determine if the data is error-free.

The X-bar switches have several internal registers. As data is transmitted to and from the data disk drives, it must go through a X-bar switch. Within the X-bar switch the data can be clocked from one register to the next before going to the buffer or the disk drive. The time it takes to clock the data through the X-bar internal registers is sufficient to allow the ACC to calculate and perform its error correction tasks. During a write operation, this arrangement allows the P and Q values to be generated and written to their designated disk drives at the same time as the data is written to its disk drives, the operations occurring in parallel. In effect the X-bar switches establish a data pipeline of several stages, the plurality of stages effectively providing a time delay circuit.

In one preferred embodiment, two ACC units are provided. Both ACCs can be used simultaneously on two operations that access different disk drives or one can be used if the other fails.

The X-bar switch arrangement also provides flexibility in the data paths. Under control of the system controller, a malfunctioning disk drive can be decoupled from the system by reconfiguring the appropriate X-bar switch or switches and the data that was to be stored on the failed disk can be rerouted to another data disk drive. As the system computer is not involved in the detection or correction of data errors, or in reconfiguring the system in the case of failed drives or buffers, these processes are said to be transparent to the system computer.

In a first embodiment of the present invention, a plurality of X-bar switches are coupled to a plurality of disk drives and buffers, each X-bar switch having at least one data path to each buffer and each disk drive. In operation a failure of any buffer or disk drive may be compensated for by rerouting the data flow through a X-bar switch to any operational drive or buffer. In this embodiment full performance can be maintained when disk drives fail.

In another embodiment of the present invention, two ACC circuits are provided. In certain operating modes, such as when all the disk drives are being written to or read from simultaneously, the two ACC circuits are redundant, each ACC acting as a back-up unit to the other. In other modes, such as when data is written to an individual disk drive, the two ACCs work in parallel, the first ACC performing a given action for a portion of the entire set of drives, while the second ACC performs a given action which is not necessarily the same for a remaining portion of the set.

In yet another embodiment, the ACC performs certain self-monitoring check operations using the P and Q redundancy terms to determine if the ACC itself is functioning properly. If these check operations fail, the ACC will indicate its failure to the control system, and it will not be used in any other operations.

In still another embodiment, the ACC unit is coupled to all the disk drives in the set and data being transmitted to or from the disk drives is simultaneously recovered by the ACC. The ACC performs either error detection or error correction upon the transmitted data in parallel with the data transmitted from the buffers and the disk drives.

The present invention provides a speed advantage over the prior art by maximizing the use of parallel paths to the disk drives. Redundancy and thus fault-tolerance is also provided by the described arrangement of the X-bar switches and ACC units.

Another aspect of the present invention is that it switches control of disk drive sets when a particular controller fails. Switching is performed in a manner transparent to the computer.

The controllers comprise a plurality of first level controllers each connected to the computer. Connected to the other side of the first level controllers is a set of second level controllers. Each first level controller can route data to any one of the second level controllers. Communication buses tie together the second level controllers and the first level controllers can also communicate between themselves. In a preferred embodiment, the system is configured such that the second level controllers are grouped in pairs. This configuration provides each second level controller with a single associated back-up controller. This configuration provides for efficient rerouting procedures for the flow of data to the disk drives. For ease of understanding, the specification will describe the system configured with pairs of second level controllers. Of course, it should be understood that the second level controllers could be configured in groups of three or other groupings.

A switching function is implemented to connect each of the second level controllers to a group of disk drives. In the case that a second level controller should fail, the computer need not get involved with the rerouting of data to the disk drives. Instead, the first level controllers and the properly working second level controller can handle the failure without the involvement of the computer. This allows the logical configuration of the disk drives to remain constant from the perspective of the computer despite a change in the physical configuration.

There are two levels of severity of failures which can arise in the second level controllers. The first type is a complete failure. In the case of a complete failure, the second level controller stops communicating with the first level controllers and the other second level controller. The first level controllers are informed of the failure by the properly working second level controller or may recognize this failure when trying to route data to the failed second level controller. In either case, the first level controller will switch data paths from the failed second level controller to the properly functioning second level controller. Once this rerouted path has been established, the properly functioning second level controller issues a command to the malfunctioning second level controller to release control of its disk drives. The properly functioning second level controller then assumes control of these disk drive sets.

The second type of failure is a controlled failure where the failed controller can continue to communicate with the rest of the system. The partner second level controller is informed of the malfunction. The properly functioning second level controller then informs the first level controllers to switch data paths to the functioning second level controller. Next, the malfunctioning second level controller releases its control of the disk drives and the functioning second level controller assumes control. Finally, the properly functioning second level controller checks and, if necessary, corrects data written to the drives by the malfunctioning second level controller.

A further aspect of the present invention is a SCSI bus switching function which permits the second level controllers to release and assume control of the disk drives.

For a more complete understanding of the nature and the advantages of the invention, reference should be made to the ensuing detail description taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a prior art disk array system;

FIG. 2 is a block diagram illustrating a prior art disk array system with an error check and correction block;

FIG. 3 is a diagram illustrating a preferred embodiment of the overall system of the present invention;

FIG. 4 is a diagram showing a more detailed illustration of FIG. 3 including the interconnections of the switches and the disk drives within the disk drive sets;

FIG. 5 is a block diagram of the wiring between the controllers and the switches;

FIG. 6 is a block diagram showing the schematic circuitry of the switching function control circuitry shown in FIG. 5;

FIG. 7 is a recovery state transition diagram illustrating the various possible states of a particular second level controller;

FIGS. 8A-8I show the events which take place during the transition between each of the states shown in FIG. 7;

FIG. 9 is a block diagram of one preferred embodiment of the X-bar circuitry;

FIG. 10 is a block diagram of a preferred embodiment of the error check and correction circuitry;

FIG. 11 is a detailed block diagram of the X-bar switches and the ACC shown in FIG. 10;

FIGS. 12a and 12b show the logic operations necessary to calculate the P and Q error detection terms;

FIGS. 13a and 13b show how the Reed-Solomon codeword is formed and stored in one embodiment of the present invention;

FIGS. 14a and 14b show the parity detector and parity generator circuits in the ACC;

FIGS. 15, 16, 17, and 18 show, respectively, the data flow during a Transaction Mode Normal Read, a Transaction Mode Failed Drive Read, a Transaction Mode Read-Modify-Write Read and a Transaction Mode Read-Modify-Write Write;

FIG. 19 is a schematic diagram of a set of disk drives in which check data is distributed among drives of the set according to a known technique;

FIG. 20 is a schematic diagram of a mass storage system suitable for use with the present invention;

FIG. 21 is a schematic diagram of the distribution of data on the surface of a magnetic disk;

FIG. 22 is a schematic diagram of the distribution of data in a first preferred embodiment of a redundancy group according to the present invention;

FIG. 23 is a schematic diagram of the distribution of data in a second, more particularly preferred embodiment of a redundancy group according to the present invention;

FIG. 24 is a diagram showing how the memory space of a device set might be configured in accordance with the principles of the present invention; and

FIG. 25 is a diagram of an exemplary embodiment of data structures for mapping between the logical levels of the present invention.

DETAILED DESCRIPTION OF THE DRAWINGS

The preferred embodiments of the present invention comprise a system for mass data storage. In the preferred embodiments described herein, the preferred devices for storing data are hard disk drives, referenced herein as disk drives. Nothing herein should be understood to limit this invention to using disk drives only. Any other device for storing data may be used, including, but not limited to, floppy disks, magnetic tape drives, and optical disks.

Overall System Environment

One preferred embodiment of the present invention operates in the environment shown in FIG. 3. In FIG. 3, computer 10 communicates with a group of disk drive sets 18 through controller 11. In a preferred embodiment, controller 11 includes a number of components which permit computer 10 to access each of disk drive sets 18 even when there is a failure in one of the components of controller 11. As shown in FIG. 3, controller 11 includes a pair of two-level devices 13. Within each of the two-level devices 13 is a first level controller 12 and a second level controller 14. A switch 16 which comprises a group of switches permits computer 10 to access disk drive sets 18 through more than one path. In this way, if either of two-level devices 13 experience a failure in one of their components, the path may be re-routed without computer 10 being interrupted.

FIG. 3.5 is a diagram showing a pair of disk drive sets 18A and 18B connected to a pair of second level controllers 14A and 14B. Controllers 14A and 14B each include two interface modules 27 for interfacing second level controllers 14 with a pair of first level controllers 12 (shown in FIG. 3). Interface modules 27 are connected to buffers 330-335 335 which buffer data to be transmitted to and received from the disk drives. Second level controllers 14 are configured to be primarily responsible for one group of disk drives and secondarily responsible for a second group of disk drives. As shown, second level controller 14A is primarily responsible for disk drives 20A1, 20B1, 20C1, 20D1 and 20E1 and secondarily responsible for disk drives 20A2, 20B2, 20C2, 20D2 and 20E2. A spare drive 20X is shared by both second level controllers and is activated to take over for a disk drive which has failed.

The disk drives are connected to second level controllers 14 through a set of data interfaces 31. These interfaces are set by controllers 14 to configure the disk drives in a particular arrangement. For example, disk drives 20A1, 20B1, 20C1 and
20D1 and 20A2, 20B2, 20C2 and 20D2 may be set to store data for the system while disk drives 20E1 and 20E2 may be set to store error correction codes. If any of the drives fail, drive 20X is set to take its place. Of course, the disk drives of the system can be rearranged and may assume a wide variety of configurations.

FIG. 4 is more detailed diagram showing the interconnection of the components of the input/output system associated with computer 10 for accessing disk drive sets 18. Computer 10 has its input/output ports connected to the first level controllers 12A and 12B. Second level controllers 14A and 14B are shown connected to first level controllers 12A and 12B. The lines between second level controllers 14 and first level controllers 12 represent data buses through which data flows as well as control and status signals. The dashed line between second level controller 14A and second level controller 14B represents a communication line through which the second level controllers communicate with each other.

Second level controllers 14 are each connected to a group of disk drive sets 18A-18F through switches 16A-16F.

Disk drives 20 are arranged in a manner so that each second level controller 14 is primarily responsible for one group of disk drive sets. As shown in FIG. 4, second level controller 14A may be primarily responsible for three of the disk drive sets 18A-18F. Similarly, second level controller 14B may be primarily responsible for the remaining three disk drive sets 18A-18F Second level controllers 14 are secondarily responsible for the disk drives primarily controlled by the partner second level controller. In the particular arrangement shown in FIG. 4, second level controller 14A may be primarily responsible for the left three disk drive sets 18A, 18B and 18C and secondarily responsible for the right three disk drives sets 18D, 18E and
18F. Second level controller 14B is primarily responsible for the right three disk drive sets 18D, 18E and 18F and secondarily responsible for the left three disk sets 18A, 18B and 18C.

Each second level controller 14 contains a second level controller recovery system (CRS) 22. CRS 22 is a portion of software code which manages the communication between second level controllers 14 and first level controllers 12. CRS 22 is typically implemented as a state machine which is in the form of microcode or sequencing logic for moving second level controller 14 from state to state (described below). State changes are triggered as different events occur and messages are sent between the various components of the system.

An ECC block 15 is also included in each second level controller 14. ECC block 15 contains circuitry for checking and correcting errors in data which occur as the data is passed between various components of the system. This circuitry is described in more detail below.

FIG. 5 is a block diagram showing a more detailed illustration of the interconnections between second level controllers 14A and 14B and the disk drives. For simplicity, only a single disk drive port is shown. More disk drive ports are included in the system as shown in FIGS. 3 and 4.

Second level controller 14A has a primary control/sense line 50A for controlling its primary set of disk drives. An alternate control/sense line 52A controls an alternate set of disk drives. Of course, second level controller 14B has a corresponding set of control/sense lines. Data buses 54A (second level controller 14A) and 54B (second level controller 14B) carry the data to and from disk drives 20. These data buses are typically in the form of a SCSI bus.

A set of switches 16A-16F are used to grant control of the disk drives to a particular second level controller. For example, in FIG. 4, second level controller 14A has primary responsibility for disk drives 20A-20C and alternate control of disk drives 20D-20F. Second level controller 14B has primary control of disk drives 20D-20F and alternate control of disk drives 20A-20C. By changing the signals on control/sense lines 50 and 52, primary and secondary control can be altered.

FIG. 6 is a more detailed illustration of one of the switches 16A-16F. A pair of pulse shapers 60A and 60B receive the signals from the corresponding control/sense lines 50A and 52B shown in FIG. 5. Pulse shapers 60 clean up the signals which may have lost clarity as they were transmitted over the lines. Pulse shapers of this type are well known in the art. The clarified signals from pulse shapers 60 are then fed to the set and reset pins of R/S latch 62. The Q and Q outputs of latch 62
are sent to the enable lines of a pair of driver/receivers 64A and 64B. Driver/receivers 64A and 64B are connected between the disk drives and second level controllers 14A and 14B. Depending upon whether primary control/sense line 52B or alternate control/sense line 50A is active, the appropriate second level controller will be in control at a particular time.

FIG. 7 is a state transition diagram showing the relationships between the various states of CRS 22 (FIG. 3) of a particular second level controller 14. Each second level controller 14 must be in only one state at any particular point in time. Initially, assuming that the system is functioning properly and each second level controller 14 is primarily responsible for half of the disk drive sets 18 and secondarily responsible for half of the disk drive sets 18, second level controller 14 is in a PRIMARY STATE 26. While in PRIMARY STATE 26, two major events may happen to move a second level controller 14 from PRIMARY STATE 26 to another state. The first event, is the failure of the particular second level controller 14. If there is a failure, second level controller 14 shifts from PRIMARY STATE 26 to a NONE STATE 28. In the process of doing so, it will pass through RUN-DOWN-PRIMARIES-TO-NONE STATE 30.

There are two types of failures which are possible in second level controller 14. The first type of failure is a controlled failure. Further, there are two types of controlled failures.

The first type of controlled failure is a directed controlled failure. This is not actually a failure but instead an instruction input from an outside source instructing a particular second level controller to shut down. This instruction may be received in second level controller 14 from one of the following sources: An operator, through computer 10; a console 19 through a port 24 (e.g. RS-232) on the first level controller; a diagnostic console 21 through a port 23 (e.g. RS-232) on the second level controller; or by software initiated during predictive maintenance. Typically, such an instruction is issued in the case where diagnostic testing of a second level controller is to be conducted. In a directed controlled failure, the second level controller finishes up any instructions it is currently involved with and refuses to accept any further instructions. The second level controller effects a "graceful" shut down by sending out messages to the partner second level controller that it will be shutting down.

The second type of controlled failure is referred to as a moderate failure. In this case, the second level controller recognizes that it has a problem and can no longer function properly to provide services to the system. For example, the memory or drives associated with that second level controller may have malfunctioned. Therefore, even if the second level controller is properly functioning, it cannot adequately provide services to the system. It aborts any current instructions, refuses to accept any new instructions and sends a message to the partner second level controller that it is shutting down. In both controlled failures, the malfunctioning second level controller releases the set of disk drives over which it has control. These drives are then taken over by the partner second level controller.

The second type of failure is a complete failure. In a complete failure, the second level controller becomes inoperable and cannot send messages or "clean-up" its currently pending instructions by aborting them. In other words, the second level controller has lost its ability to serve the system. It is up to one of the first level controllers or the partner second level controller to recognize the problem. The partner second level controller then takes control of the drives controlled by the malfunctioning second level controller. The routing through the malfunctioning second level controller is switched over to the partner second level controller.

In all of the above failures, the switching takes place without interruption to the operation of the computer. Second level controllers 14 and first level controllers 12 handle the rerouting independently by communicating the failure among themselves.

Assuming there was a failure in second level controller 14A, second level controller 14A moves from PRIMARY STATE 26 through a transition RUN-DOWN-PRIMARIES-TO-NONE STATE 30 to NONE STATE 28. At the same time, properly functioning second level controller 14B moves from PRIMARY STATE 26 to BOTH STATE 32. The basis for the change in state of each of second level controllers 14A and 14B is the failure of second level controller 14A. When a second level controller fails, it is important to switch disk drive control away from the failed second level controller. This permits computer 10 to continue to access disk drives which were formerly controlled by a particular second level controller which has failed. In the current example (FIG. 4), disk drive sets 18A-18C are switched by switching functions 16A-16C so that they are controlled by second level controller 14B. Therefore, second level controller 14B is in BOTH STATE 32 indicating that it has control of the disk drive sets 18 for both second level controllers. Second level controller 14A now controls none of the disk drives and is in NONE STATE 28. The transition state 30 determines which of several possible transition paths is used.

If second level controller 14A is in NONE STATE 28 and second level controller 14B is in BOTH STATE 32 there are a number of options for transferring control of disk drive sets 18A-18F once second level controller 14A has been repaired. First, second level controller 14A and second level controller 14B could each be shifted back to PRIMARY STATE 26. This is accomplished for drive sets 18A-18C by second level controller 14A moving from NONE STATE 28 directly to PRIMARY STATE 26 along the preempt p line. Preempt p simply stands for "preempt primary" which means that second level controller 14A preempts its primary drives or takes control of them from second level controller 14B. At the same time, second level controller 14B moves from BOTH STATE 32 through a transition RUN-DOWN-SECONDARIES-TO-PRIMARIES STATE 34 and then to PRIMARY STATE 26.

A second alternative is for second level controller 14A to move from NONE STATE 28 to SECONDARY STATE 36. Once in SECONDARY STATE 36, second level controller 14A is in control of its secondary disk drive sets 18D-18F. Second level controller
14B concurrently moves from BOTH STATE 32 through RUN-DOWN-PRIMARIES-TO-SECONDARIES STATE 38 and on to SECONDARY STATE 36. When both second level controllers are in SECONDARY STATE 36, they are in control of their secondary disk drive sets. Second level controller 14A controls disk drive sets 18D-18F and second level controller 14B controls disk drive sets 18A-18C.

From SECONDARY STATE 36, a failing second level controller 14 may move through RUN-DOWN-SECONDARIES-TO-NONE STATE 40 to NONE STATE 28. If this occurs, the properly functioning partner second level controller 14 moves from SECONDARY STATE 36 to BOTH STATE 32 so that computer 10 could access any one of disk drive sets 18. As in the previous example, if second level controller 14A were to fail it moves from SECONDARY STATE 36 through RUN-DOWN-SECONDARIES-TO-NONE STATE 40 and into NONE STATE 28. At the same time, properly functioning second level controller 14B moves from SECONDARY STATE 36 along the preempt b/p line into BOTH STATE 32. Preempt b/p stands for "preempt both/primaries." In other words, all of the disk drives are preempted by the properly functioning second level controller.

If, for all sets 18, second level controller 14A is in NONE STATE 28 and second level controller 14B is in BOTH STATE 32, it is possible for second level controller 14A to take control of all sets 18 of disk drives. This is desirable if second level controller 14A were repaired and second level controller 14B failed. Second level controller 14A moves from NONE STATE 28 along the preempt b line to BOTH STATE 32. At the same time, second level controller 14B moves from BOTH STATE 32 through RUN-DOWN-BOTH-TO-NONE STATE 42 and into NONE STATE 28. At this point, second level controller 14A controls all disk drives while second level controller 14B controls none of the disk drives.

Various failures may trigger the movement of second level controllers 14 between states. Between states a number of events take place. Each of these events is described in FIGS. 8A-8I. In FIG. 8A, second level controller 14 is in PRIMARY STATE
26. There are three different events which can take place while second level controller 14 is in PRIMARY STATE 26. The first event is for a preempt message 100 to be received from the partner second level controller. At this point, the second level controller receiving such a message will take the secondary path, represented by block 102, and end up at BOTH STATE 32. The second path which may be taken is triggered by receipt of a message 104 from CRS 22 of the other second level controller. This may be some sort of communication which results in the second level controller remaining in PRIMARY STATE 26. It will report and return messages 106 to the other second level controller. The final path which may be taken results in second level controller ending up in RUN-DOWN-PRIMARIES-TO-NONE STATE 30. This path is triggered upon receipt of a message 108 to release both sets of drives or the primary disk drives. A timer is then set in block 110 and upon time out a message 112 is sent to the other second level controller to take control of the primary set of disk drives. Once in RUN-DOWN-PRIMARIES-TO-NONE STATE 30, second level controller 14 will eventually end up in NONE STATE 28.

FIG. 8B illustrates various paths from RUN-DOWN-PRIMARIES-TO-NONE STATE 30 to NONE STATE 28. Three possible events may take place. First, a message 114 may be received from another second level controller providing communication information. In this case, second level controller 14 reports back messages 116 and remains in RUN-DOWN-PRIMARIES-TO-NONE STATE 30. The second event which may occur is for the timer, set during transition from PRIMARY STATE 26 to RUN-DOWN-PRIMARIES-TO-NONE STATE 30
to time out 118. If this happens, second level controller 14 realizes that message 112 (FIG. 8A) didn't get properly sent and that there has been a complete failure. It releases control of both its primaries and secondary disk drives 122. It then ends up in NONE STATE 28. The third event which may occur while in RUN-DOWN-PRIMARIES-TO-NONE STATE 30 is for a response to be received 124 from message 112 (FIG. 8A) sent out while second level controller moved from PRIMARY STATE 26 to RUN-DOWN-PRIMARIES-TO-NONE STATE 30. This response indicates that the message was properly received. Second level controller 14 then releases its primary drives 126 and ends up in NONE STATE 28.

FIG. 8C covers the state transition between NONE STATE 28 and one of either BOTH STATE 32, PRIMARY STATE 26, or SECONDARY STATE 36. When in NONE STATE 28, second level controller 14 can only receive messages. First, it may receive a message 128
instructing it to preempt both its primary and alternative sets of disk drives. It performs this function 130 and ends up in BOTH STATE 32. A second possibility is for it to receive a preempt message 132 instructing it to preempt its primary set of drives. It performs this instruction and ends up in PRIMARY STATE 26. A third alternative is the receipt of a preempt message 136 instructing second level controller 14 to preempt its secondary drives. Upon performance of this instruction 138 it ends up in SECONDARY STATE 36. Finally, while in NONE STATE 28 second level controller 14 may receive communication messages 140 from its partner second level controller. It reports back 142 to the other second level controller and remains in NONE STATE 28.

FIG. 8D illustrates the movement of second level controller 14 from SECONDARY STATE 36 to BOTH STATE 32 or RUN- DOWN-SECONDARIES-TO-NONE STATE 40. While in SECONDARY STATE 36, any one of three messages may be received by second level controller
14. A first possibility is for a preempt both or primary message 144 to be received. At this point, second level controller 14 takes control of its primary drives 146 and ends up in BOTH STATE 32. A second possibility is for communication messages 148
to be received from the partner controller. This results in second level controller 14 reporting back 150 and remaining in its present SECONDARY STATE 36. Finally, a release both or secondary message 152 may be received. Second level controller 14
sets a timer 154 upon receipt of this message. It then sends out a message 156 indicating it is now in RUN-DOWN-SECONDARIES-TO-NONE STATE 40.

FIG. 8E shows the transition of second level controller 14 from RUN-DOWN-SECONDARIES-TO-NONE STATE 40 to NONE STATE 28. Three different messages may be received during RUN-DOWN-SECONDARIES-TO-NONE STATE 40. First, messages 158 from the partner second level controller may be received. Second level controller 14 then reports back (160) to its partner and remains in RUN-DOWN-SECONDARIES-TO-NONE STATE 40. A second possibility is for the timer, set between SECONDARY STATE 36 and the present state, to time out (162). This indicates that message 156 (FIG. 8D) was not properly sent out and received by the partner second level controller and that there has been a complete failure to second level controller 14. Second level controller 14 then reports out (164) that it will release both of its sets of disk drives 166. This results in it moving to NONE STATE 28. Finally, second level controller 14 may receive a response 168 to its message 156 (FIG. 8D) sent after setting the timer between SECONDARY STATE 36 and RUN-DOWN-SECONDARIES-TO-NONE STATE 40. Upon receiving this response, it releases its secondary drives and ends up in NONE STATE 28.

FIG. 8F illustrates the various paths from BOTH STATE 32 to any one of RUN-DOWN-PRIMARIES-TO-SECONDARIES STATE 38, RUN-DOWN-SECONDARIES-TO-PRIMARIES STATE 34 or RUN-DOWN-BOTH-TO-NONE STATE 42. A first possible message which may be received during BOTH STATE 32 is a release primary message 172. This will cause second level controller 14 to set a timer 174, send a message 176 indicating it is running down primaries, and wait in RUN-DOWN-PRIMARIES-TO-SECONDARIES STATE 38. A second message which may be received is a release secondaries message 180. Upon receiving release secondaries message 180, second level controller 14 sets a timer 182 and sends a message 184 indicating it has moved into RUN-DOWN-SECONDARIES-TO-PRIMARIES STATE 34. A third possibility for second level controller 14 is to receive communication messages 186 from its partner second level controller. It will report back (188) and remain in BOTH STATE 32. Finally, second level controller 14 may receive an instruction
190 telling it to release both primary and secondary sets of drives. At this point it sets the timer 192 and sends out a message 194 that it has released both primary and secondary drive sets. It will then remain in the RUN-DOWN-BOTH-TO-NONE STATE 42
until it receives further instructions from the other second level controller.

FIG. 8G shows the various paths by which second level controller 14 moves from RUN-DOWN-PRIMARIES-TO-SECONDARIES STATE 38 to one of either NONE STATE 28 or SECONDARY STATE 36. The first possibility is that second level controller 14 receives messages 196 from the other second level controller. It then reports back (198) and remains in RUN-DOWN-PRIMARIES-TO-SECONDARIES STATE 38. A second possibility is that the timer (174), set between BOTH STATE 32 and RUN-DOWN-PRIMARIES-TO-SECONDARIES STATE 38 times out (200). At this point, second level controller 14 realizes that message 176 (FIG. 8F) was not properly sent. A complete failure has occurred. The second level controller reports (202) that it has released both sets of disk drives, and releases both sets (204). Second level controller 14 then enters NONE STATE 28. Finally, a run down path response message 206 is received acknowledging receipt of message 176 (FIG. 8F) sent between BOTH STATE 32 and RUN-DOWN-PRIMARIES-TO-SECONDARIES STATE 38. Second level controller 14 releases its primary drives 208 and enters SECONDARY STATE 36.

FIG. 8H shows the possible paths down which second level controller 14 moves between RUN-DOWN-SECONDARIES-TO-PRIMARIES STATE 34 and one of either NONE STATE 28 or PRIMARY STATE 26. A first possibility is that second level controller 14 receives a message 210 from the other second level controller. It then reports back (212) and remains in RUN-DOWN-SECONDARIES-TO-PRIMARIES STATE 34. A second possibility is that the timer (182), set between BOTH STATE 32 and RUN-DOWN-SECONDARIES-TO PRIMARY-STATE 34 times out (214). If this occurs, second level controller 14 realizes that message 184 (FIG. 8F) was not properly sent. A complete failure has occurred. Second level controller then sends a message 216 indicating that it has released its drives and then it releases both primary and secondary disk drive sets (218) which it controls. Second level controller then moves into NONE STATE 28. Finally, a third possibility is that second level controller 14 receives a response 220 to message 184 (FIG. 8F) sent between BOTH STATE 32 and RUN-DOWN-SECONDARIES-TO-PRIMARIES-STATE 34. It will then release (222) its secondary drives and enter PRIMARY STATE 26.

FIG. 8I shows the possible paths illustrating the transition of second level controller between RUN-DOWN-BOTH-TO-NONE STATE 42 and NONE STATE 28. Three possible events may take place. First, a message 230 may be received from the other second level controller providing communication information. In this case, second level controller 14 reports back messages 232 and remains in RUN-DOWN-BOTH-TO-NONE STATE 42. The second event which may occur is for the timer (192), set during transition from BOTH STATE 32 to RUN-DOWN-BOTH-TO-NONE STATE 42, to time out (234). If this happens, second level controller 14 realizes that message 194 (FIG. 8F) sent during BOTH STATE 32 didn't get properly sent and that there has been a complete failure. It releases control of both its primaries and secondary disk drives (238). It then ends up in NONE STATE 28. The third event which may occur while in RUN-DOWN-BOTH-TO-NONE STATE 42 is for a response to be received (240) from message 194 (FIG. 8F) sent out while second level controller moved from BOTH STATE 32 to RUN-DOWN-BOTH-TO-NONE STATE 42. This response indicates that the message was properly received. Second level controller 14 then releases both sets of drives (242) and ends up in NONE STATE 28.

Rerouting Data Paths Between Buffers and Disk Drives

FIG. 9 illustrates a first preferred embodiment of circuitry for rerouting data paths between buffers and disk drives 20. In FIG. 9, X-bar switches 310 through 315 are coupled to a bus 309 communicating with the second level controller engine (see FIGS. 3 and 4). In turn, each X-bar switch is coupled by a bus to disk drives 20AI through 20A6 and to each buffer 330 through 336. Bus 350 couples each buffer to a first level controller which are coupled to a computer such as computer 10 (FIGS.
3 and 4). In this embodiment, although only six disk drives are illustrated, any arbitrary number could be used, as long as the illustrated architecture is preserved by increasing the number of X-bar switches and output buffers in a like manner and maintaining the interconnected bus structures illustrated in FIG. 9.

In operation, the second level controller will load various registers (not illustrated herein) which configure the X-bar switches to communicate with particular buffers and particular disk drives. The particular configuration can be changed at any time while the system is operating. Data flow is bi-directional over all the buses. By configuring the X-bar switches, data flowing from any given buffer may be sent to any given disk drive or vice versa. Failure of any particular system element does not result in any significant performance degradation, as data flow can be routed around the failed element by reconfiguring the registers for the X-bar switch. In a preferred mode of operation, data may be transferred from or to a particular disk drive in parallel with other data transfers occurring in parallel on every other disk drive. This mode of operation allows for a very high input/output rate of data as well as a high data throughput.

To illustrate this embodiment's mode of operation, the following example is offered. Referring to FIG. 9, assume that all data flow is initially direct, meaning, for example, that data in buffer 330 flows directly through X-bar switch 110 to disk drive 20A1. Were buffer 330 to fail, the registers of X-bar switch 310 could be reconfigured, enabling X-bar switch 310 to read data from buffer 335 and direct that data to disk drive 20A1. Similar failures in other buffers and in the disk drives could be compensated for in the same manner.

Generation of Redundancy Terms and Error Detection on Parallel Data

FIG. 10 illustrates a second preferred embodiment of the present invention. This second embodiment incorporates Array Correction Circuits ("ACCs") to provide error detection and correction capabilities within the same general architecture as illustrated for the first preferred embodiment shown in FIG. 9. To ease the understanding of this embodiment, the full details of the internal structure of both the X-bar switches (310 through 315) and the ACC circuits 360 and 370 are not shown in FIG.
10. FIGS. 11 and 12 illustrate the internal structure of these devices and will be referenced and discussed in turn. Additionally, bus LBE as illustrated in FIG. 10 does not actually couple the second level controller (FIGS. 3 and 4) directly to the X-bar switches, the ACCs, and the DSI units. Instead, the second level controller communicates with various sets of registers assigned to the X-bar switches, the ACCs and the DSI units. These registers are loaded by the second level controller with the configuration data which establishes the operating modes of the aforementioned components. As such registers are known, and their operation incidental to the present invention, they are not illustrated or discussed further herein.

The embodiment shown in FIG. 10 shows data disk drives 20A1 through 20A4 and P and Q redundancy term drives 20A5 and 20A6. A preferred embodiment of the present invention utilizes 13 disk drives: ten for data, two for P and Q redundancy terms, and one spare or backup drive. It will be understood that the exact number of drives, and their exact utilization may vary without in any way changing the present invention. Each disk drive is coupled by a bi-directional bus (Small Computer Standard Interface) to units 340 through 345, herein labelled DSI. The DSI units perform some error detecting functions as well as buffering data flow into and out of the disk drives.

Each DSI unit is in turn coupled by a bi-directional bus means to an X-bar switch, the X-bar switches herein numbered 310 through 315. The X-bar switches are coupled in turn to word assemblers 350 through 355 by means of a bi-directional bus. The bus width in this embodiment is 9 bits, 8 for data, 1 for a parity bit. The word assemblers assemble 36-bit (32 data and 4 parity) words for transmission to buffers 330 through 335 over bi-directional buses having a 36-bit width. When data flows from the output buffers to the X-bar switches, the word assemblers decompose the 36-bit words into the 9-bits of data and parity.

The X-bar switches are also coupled to ACC units 348 and 349. The interconnection between the X-bar switches and the ACCs is shown in more detail in FIG. 11. Each X-bar switch can send to both or either ACC the 8 bits of data and 1 parity bit that the X-bar switch receives from either the DSI units or the word assemblers. In turn, the X-bar switches can receive 9 bits of the P and Q redundancy terms calculated by the ACCs over lines E.sub.1 and E.sub.2. As shown, the ACCs can direct the P and Q redundancy terms to any X-bar switch, not being limited to the disk drives labelled P and Q. Depending on the configuration commanded by the second level controller, ACCs 348 and 349 can be mutually redundant, in which case the failure of one or the other ACC does not affect the system's ability to detect or correct errors, or each ACC can detect and correct errors on a portion of the total set of disk drives. When operating in this second manner, certain specific types of operations which write data to individual disk drives are expedited, as each ACC can write to a separate individual disk drive. The specific disk drives that the individual ACCs monitor can be reconfigured at any time by the second level controller.

The illustrated connections of the ACCs and the X-bar switches also allows data to be switched from any X-bar switch to any ACC once the second level controller configures the related registers. This flexibility allows data to be routed away from any failed disk drive or buffer.

FIG. 11 shows important internal details of the ACCs and the X-bar switches. X-bar switch 310 is composed of two mirror-image sections. These sections comprise, respectively, 9-bit tri-state registers 370/380, multiplexers 372/382, first 9-bit registers 374/384, second 9-bit registers 376/386, and input/output interfaces 379/389. In operation, data can flow either from the word assembler to the DSI unit or vice versa.

Although many pathways through the X-bar switch are possible, as shown by FIG. 11, two aspects of these pathways are of particular importance. First, in order to allow the ACC sufficient time to calculate P and Q redundancy terms or to detect and correct errors, a data pathway of several registers can be used, the data requiring one clock cycle to move from one register to the next. By clocking the data through several registers, a delay of sufficient length can be achieved. For example, assuming a data flow from the word assembler unit to a disk drive, 9 bits are clocked into 9-bit register 374 and tristate register 370 on the first clock pulse. On the next clock pulse, the data moves to 9-bit register 386 and through redundancy circuit 302 in the ACC 348 to P/Q registers 304 and 306. The next clock pulses move the data to the DSI unit.

The second important aspect of the internal pathways relates to the two tri-state registers. The tri-state registers are not allowed to be active simultaneously. In other words, if either tri-state register 370 or 380 is enabled, its counterpart is disabled. This controls data transmission from the X-switch to the ACC. The data may flow only from the DSI unit to the ACC or from the word assembler to the ACC, but not from both to the ACC simultaneously. In the opposite direction, data may flow from the ACC to the word assembler and the DSI simultaneously.

ACC unit 348 comprises a redundancy circuit 302, wherein P and Q redundancy terms are generated, P and Q registers 304 and 306, wherein the P and Q redundancy terms are stored temporarily, regenerator and corrector circuit 308, wherein the data from or to a failed disk drive or buffer can be regenerated or corrected, and output to interfaces 390, 391, 392 and 393.

Redundancy Generation and Error Checking Equations

The main functional components of the second preferred embodiment and their physical connections to one another have now been described. The various preferred modes of operation will now be described. In order to understand these functional modes, some understanding of the error detection and correction method used by the present invention will be necessary.

Various error detection and correction codes are known and used in the computer industry. Error-Control Coding and Applications, D. Wiggert, The MITRE Corp., describes various such codes and their calculation. The present invention in this second preferred embodiment is implemented using a Reed-Solomon error detection and correction code. Nothing herein should be taken to limit the present invention to using only a Reed-Solomon code. If other codes were used, various modifications to the ACCs would be necessary, but these modifications would in no way change the essential features of this invention.

Reed-Solomon codes are generated by means of a field generator polynomial, the one used in this embodiment being X.sup.4 +X+1. The code generator polynomial needed for this Reed-Solomon code is (X+a.sup.0).multidot.(X+a.sup.1)=X.sup.2 +a.sup.4
X+a.sup.1. The generation and use of these codes to detect and correct errors is known.

The actual implementation of the Reed-Solomon code in the present invention requires the generation of various terms and syndromes. For purposes of clarity, these terms are generally referred to herein as the P and Q redundancy terms. The equations which generate the P and Q redundancy terms are:

The P redundancy term is essentially the simple parity of all the data bytes enabled in the given calculation. The Q logic calculates the Q redundancy for all data bytes that are enabled. For Q redundancy, input data must first be multiplied by a constant "a" before it is summed. The logic operations necessary to produce the P and Q redundancy terms are shown in FIGS. 12a and 12b. All operations denoted by .sym. are exclusive-OR ("XOR") operations. Essentially, the final P term is the sum of all P.sub.i terms. The Q term is derived by multiplying all Q.sub.i terms by a constant and then XORing the results. These calculations occur in redundancy circuit 302 in ACC 260 (FIG. 11). The second preferred embodiment, using its implementation of the Reed-Solomon code, is able to correct the data on up to two failed disk drives.

The correction of data requires the generation of additional terms S.sub.0 and S.sub.1 within the ACC. Assuming that the P and Q redundancy terms have already been calculated for a group of data bytes, the syndrome equations ##EQU1## are used to calculate S.sub.0 and S.sub.1. For S.sub.0 an ACC register enables the necessary data bytes and the P redundancy to be used in the calculation. For S.sub.1, the necessary input data must first be multiplied by a.sub.i before being summed with the Q redundancy information.

As stated, an ACC can correct the data on up to two failed disk drives in this embodiment. The failed disk drive register (not illustrated) in the relevant ACC will be loaded with the address of the failed disk or disks by the second level controller. A constant circuit within the ACC will use the drive location information to calculate two constants k.sub.0 and k.sub.1 as indicated in Table 1 below, where i represents the address of the first failed disk drive, j is the address of the second failed disk drive, and a is a constant. The columns labelled Failed Drives indicate which drives have failed. Column k.sub.0 and k.sub.1 indicate how those constants are calculated given the failure of the drives noted in the Failed Drives columns.

TABLE 1 ______________________________________ Failed Drives k.sub.0 k.sub.1 ______________________________________ P -- 0 1 Q -- 1 0 i -- 0 1/ai i P 0 1/ai Q i 0 0 i j aj/ai+aj 1/ai+aj P Q 0 0 ______________________________________

The error correction circuits use the syndrome information S.sub.0 and S.sub.1, as well as the two constants k.sub.0 and k.sub.1 to generate the data contained on the failed disk drives. The error correction equations are as follows:

F.sub.1 is the replacement data for the first failed disk drive. F.sub.2 is the replacement data for the second failed disk drive. The equations which generate the P and Q redundancy terms are realized in combinatorial logic, as is partially shown in FIGS. 12a and 12b. This has the advantage of allowing the redundancy terms to be generated and written to the disk drives at the same time that the data is written to the drives. This mode of operation will be discussed later.

Operational Modes

Having described the aspects of the Reed-Solomon code implementation necessary to understand the present invention, the operational modes of the present invention will now be discussed.

The second preferred embodiment of the present invention operates primarily in one of two classes of operations. These are parallel data storage operations and transaction processing operation. These two classes of operations will now be discussed with reference to the figures, particularly FIGS. 10, 13 and 14 and Tables 2 through 7.

Although FIG. 10 only shows four data drives and the P and Q redundancy term drives, a preferred embodiment uses a set of 13 disk drives, 10 for data, 2 for the P and Q terms, and a spare. Although nothing herein should be construed to limit this discussion to that specific embodiment, parallel processing operations will be described with relation to that environment.

Parallel Processing Operations

In parallel processing operations, all the drives are considered to comprise a single large set. Each of the disk drives will either receive or transmit 9 bits of data simultaneously. The result of this is that the 9-bits of data appearing in the DSI units of all the drives simultaneously are treated as one large codeword. This result is shown in FIG. 13a. Codeword 400 comprises 9 bits of data from or for disk drive d.sub.n-1, 9 bits of data from or for disk drive d.sub.n-2, and so on, with the P and Q disk drives receiving or transmitting the P and Q redundancy term. In a parallel write operation, all the disk drives in the set, except for the spare disk drive, will receive a byte of data (or a redundancy term whose length is equal to the data byte) simultaneously. As shown, the same sector in all the disk drives will receive a part of codeword 400. For example, in the illustration, sector 1 of disk drive n-1 will receive a byte of data designated d.sub.n-1 from codeword 400, sector 1
of disk drive n-2 will receive a byte of data designated d.sub.n-2 from codeword 400 and so on.

In the actual implementation of this preferred embodiment, the codewords are "striped" across the various disk drives. This means that for each successive codeword, different disk drives receive the P and Q redundancy terms. In other words, drive d.sub.n-1 is treated as drive d.sub.n-2 for the second codeword and so on, until what was originally drive d.sub.n-1 receives a Q redundancy term. Thus, the redundancy terms "stripe" through the disk drives.

Pairs of P and Q Terms for Nibbles

Calculating the P and Q redundancy terms using 8-bit symbols would require a great deal of hardware. To reduce this hardware overhead, the calculations are performed using 4-bit bytes or nibbles. This hardware implementation does not change the invention conceptually, but does result in the disk drives receiving two 4-bit data nibbles combined to make one 8-bit byte. In FIG. 13b, codeword 450, as well as the illustrated sectors A of the disk drives, illustrate how the codeword is broken up and how the disk drives receive upper and lower 4-bit nibbles. Table 2 shows how, for codewords one through N, a different portion of the codeword is placed on the different drives. Each disk drive, for a given codeword, receives an upper and lower 4-bit nibble, designated with L's and U's, of the codeword. Additionally, the same section is used to store the nibbles on each of the disk drives used to store the codeword. In other words, for codeword.sub.1, the first sector of disk drive n-1 through 0
receives the nibbles.

TABLE 2 __________________________________________________________________________ CODEWORD - DATA AND P AND Q Sector of Sector of Sector of Sector of Sector of Drive d.sub.n-1 Drive d.sub.n-2 Drive d.sub.0 Drive P Drive Q __________________________________________________________________________ Codeword.sub.1 Codeword.sub.1 Codeword.sub.1 Codeword.sub.1 Codeword.sub.1 Codeword.sub.1 (d.sub.n-1.sbsb.L)(d.sub.n-1.sbsb.U) (d.sub.n-2.sbsb.L)(d.sub.n-2.sbsb.U) (d.sub.0.sbsb.L)(d.sub.0.sbsb.U) (P.sub.1.sbsb.L)(P.sub.1.sbsb.U) (Q.sub.1.sbsb.L)(Q.sub.1.sbsb.U) Codeword.sub.2 Codeword.sub.2 Codeword.sub.2 Codeword.sub.2 Codeword.sub.2 Codeword.sub.2 (d.sub.n-1.sbsb.L)(d.sub.n-1.sbsb.U) (d.sub.n-2.sbsb.L)(d.sub.n-2.sbsb.U) (d.sub.0.sbsb.L)(d.sub.0.sbsb.U) (P.sub.2.sbsb.L)(P.sub.2.sbsb.U) (Q.sub.2.sbsb.L)(Q.sub.2.sbsb.U) . . Codeword.sub.n Codeword.sub.n Codeword.sub.n Codeword.sub.n Codeword.sub.n Codeword.sub.n (d.sub.n-1.sbsb.L)(d.sub.n-1.sbsb.U) (d.sub.n-2.sbsb.L)(d.sub.n-2.sbsb.U) (d.sub.0.sbsb.L)(d.sub.0.sbsb.U) (P.sub.n.sbsb.L)(P.sub.n.sbsb.U (Q.sub.n.sbsb.L)(Q.sub.n.sbsb.U) __________________________________________________________________________

Referring back to FIG. 10, for a parallel data write to the disks, the data is provided in parallel from buffers 330, 331, 332 and 333 along those data buses coupling the buffers to X-bar switches 310, 311, 312, and 313 after the 36-bits of data are disassembled in word assemblers 350 through 353 into 9-bit words. These X-bar switches are also coupled to inputs D3, D2, D1 and D0, respectively, of ACC 348 and ACC 349. In parallel processing modes, the two ACCs act as mutual "backups" to one another. Should one fail, the other will still preform the necessary error correcting functions. In addition to operating in a purely "backup" condition, the second level controller engine configures the ACCs so that each ACC is performing the error detection and correction functions for a portion of the set, the other ACC performing these functions for the remaining disk drives in the set. As the ACC units are still coupled to all the disk drives, failure of one or the other unit does not impact the system as the operating ACC can be reconfigured to act as the dedicated ACC unit for the entire set. For purposes of discussion, it is assumed here that ACC 348 is operating. ACC 348 will calculate the P and Q redundancy term for the data in the X-bar switches and provide the terms to its E.sub.1 and E.sub.2 outputs, which outputs are coupled to all the X-bar switches. For discussion only, it is assumed that only the E.sub.2 connection of X-bar switch 314 and the E.sub.1 connection of X-bar switch 315 are enabled. Thus, although the data is provided along the buses coupling ACC 348's E.sub.1 and E.sub.2 output to all the X-bar switches, the Q term is received only by X-bar switch 314 and the P term is received by X-bar switch 315. Then, the Q and P terms are provided first to DSI units 344 and 345 and then disk drives 20A5 and 20A6. It should be recalled that the various internal registers in the X-bar switches will act as a multi-stage pipeline, effectively slowing the transit of data through the switches sufficiently to allow ACC 348's redundancy circuit 302 to calculate the P and Q redundancy terms.

As ACC 349 is coupled to the X-bar switches in a substantially identical manner to ACC 348, the operation of the system when ACC 349 is operational is essentially identical to that described for ACC 348.

Subsequent parallel reads from the disks occur in the following manner. Data is provided on bi-directional buses to DSI units 340, 341, 342 and 343. P and Q redundancy terms are provided by DSI units 345 and 344, respectively. As the data and P and Q terms are being transferred through X-bar switches 310 through 315, ACC 348 uses the P and Q terms to determine if the data being received from the disk drives is correct. Word assemblers 350 through 353 assemble successive 9-bit words until the next 36-bits are available. This 36-bits are forwarded to buffers 330 through 333. Note that the 9-bit words are transmitted to the buffers in parallel. If that data is incorrect, the second level controller will be informed.

During a parallel read operation, in the event that there is a failure of a disk drive, the failed disk drive will, in certain instances, communicate to the second level controller that it has failed. The disk drive will communicate with the second level controller if the disk drive cannot correct the error using its own corrector. The second level controller will then communicate with ACCs 348 and 349 by loading the failed drive registers in the ACC (not shown in the figures) with the address of the failed drive. The failed drive can be removed from the set by deleting its address from the configuration registers. One of the set's spare drives can then be used in place of the failed drive by inserting the address of the spare drive into the configuration registers.

The ACC will then calculate the replacement data necessary to rewrite all the information that was on the failed disk onto the newly activated spare. In this invention, the term spare or backup drive indicates a disk drive which ordinarily does not receive or transmit data until another disk drive in the system has failed.

When the data, P, and Q bytes are received, the ACC circuits use the failed drive location in the failed drive registers to direct the calculation of the replacement data for the failed drive. After the calculation is complete, the data bytes, including the recovered data, are sent to data buffers in parallel. Up to two failed drives can be tolerated with the Reed-Solomon code implemented herein. All operations to replace failed disk drives and the data thereon occur when the system is operating in a parallel mode.

Regeneration of data occurs under second level controller control. When a failed disk drive is to be replaced, the ACC regenerates all the data for the replacement disk. Read/write operations are required until all the data has been replaced. The regeneration of the disk takes a substantial amount of time, as the process occurs in the background of the system's operations so as to reduce the impact to normal data transfer functions. Table 3 below shows the actions taken for regeneration reads. In Table 3, i represents a first failed drive and j represents a second failed drive. In Table 3, the column labelled Failed Drives indicates the particular drives that have failed. The last column describes the task of the ACC given the particular indicated failure.

TABLE 3 ______________________________________ Regeneration Read Failed Drives ______________________________________ P -- ACC calculates P redundancy Q -- ACC calculates Q redundancy i -- ACC calculates replacement data for i drive i P ACC calculates replacement data for i drive and P redundancy Q i ACC calculates replacement data for i drive and Q redundancy j i ACC calculates replacement data for i and j drives P Q ACC calculates P and Q redundancy ______________________________________

It should be noted that if both a data disk drive and a redundancy disk drive fail, the data on the data disk drive must be regenerated before the redundancy terms on the redundancy drive. During a regeneration write, regeneration data or redundancy terms are written to a disk and no action is required from the ACC logic.

During a parallel read operation, it should also be noted that additional error detection may be provided by the ACC circuitry.

Table 4 indicates what actions may be taken by the ACC logic unit when the indicated drive(s) has or have failed during a failed drive read operation. In this operation, the drives indicated in the Failed Drives columns are known to have failed prior to the read operation. The last column indicates the ACC response to the given failure.

TABLE 4 ______________________________________ Failed Drives ______________________________________ P No action by ACC Q No action by ACC i ACC calculates replacement data i P ACC calculates the replacement data Q i ACC calculates the replacement data i j ACC calculates replacement data P Q No action by ACC ______________________________________

Transaction Processing Mode: Read

Transaction processing applications require the ability to access each disk drive independently. Although each disk drive is independent, the ACC codeword with P and Q redundancy is maintained across the set in the previously described manner. For a normal read operation, the ACC circuitry is not generally needed. If only a single drive is read, the ACC cannot do its calculations since it needs the data from the other drives to assemble the entire codeword to recalculate P and Q and compare it to the stored P and Q. Thus, the data is assumed to be valid and is read without using the ACC circuitry (see FIG. 15). Where drive 20C1 is the one selected, the data is simply passed through DSI unit 343 X-bar switch 312, word assembler 352 and buffer 332 to the external computer. If the disk drive has failed, the read operation is the same as a failed drive read in parallel mode with the exception that only the replacement data generated by the ACC is sent to the data buffer. In this case, the disk drive must notify the second level controller that it has failed, or the second level controller must otherwise detect the failure. Otherwise, the second level controller will not know that it should read all the drives, unless it assumes that there might be an error in the data read from the desired drive. The failed drive read is illustrated in FIG. 16, with drive 20C1 having the desired data, as in the example of FIG. 15. In FIG. 16, the second level controller knows that drive 20C1 has failed, so the second level controller calls for a read of all drives except the failed drive, with the drive 20C1 data being reconstructed from the data on the other drives and the P and Q terms. Only the reconstructed data is provided to its buffer, buffer 332, since this is the only data the external computer needs.

Transaction Processing Mode: Write

When any individual drive is written to, the P and Q redundancy terms must also be changed to reflect the new data (see FIG. 18). This is because the data being written over was part of a code word extending over multiple disk drives and having P and Q terms on two disk drives. The previously stored P and Q terms will no longer be valid when part of the codeword is changed, so new P and Q terms, P" and Q", must be calculated and written over the old P and Q terms on their respective disk drives. P" and Q" will then be proper redundancy terms for the modified code word.

One possible way to calculate P" and Q" is to read out the whole codeword and store it in the buffers. The new portion of the codeword for drive 20C1 can then be supplied to the ACC circuit along with the rest of the codeword, and the new P" and Q" can be calculated and stored on their disk drives as for a normal parallel write. However, if this method is used, it is not possible to simultaneously do another transaction mode access of a separate disk drive (i.e., drive 20A1) having part of the codeword, since that drive (20A1) and its buffer are needed for the transaction mode write for the first drive (20C1).

According to a method of the present invention, two simultaneous transaction mode accesses are made possible by using only the old data to be written over and the old P and Q to calculate the new P" and Q" for the new data. This is done by calculating an intermediate P' and Q' from the old data and old P and Q, and then using P' and Q' with the new data to calculate the new P" and Q". This requires a read-modify-write operation on the P and Q drives. The equations for the new P and Q redundancy are:

New P redundancy (P")=(old P-old data)+new data

New Q redundancy (Q")=(old Q-old data . a.sub.i)+new data ai

P'=old P-old data

Q'=old Q-old data . a.sub.i

Where a.sub.i is the coefficient from the syndrome equation S.sub.1 ; and

i is the index of the drive.

During the read portion of the read-modify-write, the data from the drive to be written to and the P and Q drives are summed by the ACC logic, as illustrated in FIG. 17. This summing operation produces the P' and Q' data. The prime data is sent to a data buffer. When the new data is in a data buffer, the write portion of the cycle begins as illustrated in FIG. 18. During this portion of the cycle, the new data and the P' and Q' data are summed by the ACC logic to generate the new P" and Q" redundancy. When the summing operation is complete, the new data is sent to the disk drive and the redundancy information is sent to the P and Q drives.

Parity Check of P and Q for Transaction Mode Write

During these read-modify-write operations, it is also possible that the ACC unit itself may fail. In this case, if the data in a single element were to be changed by a read-modify-write operation, a hardware failure in the ACC might result in the redundancy bytes for the new data being calculated erroneously. To prevent this occurrence, the parity detector and parity generator are made part of the ACC circuitry. This additional redundant circuit is shown in FIGS. 14a and 14b and resides within redundancy circuit 302 as shown in FIG. 11. When data is received by the ACC circuitry, parity is checked to insure that no errors have occurred using the P and Q redundancy terms. In calculating Q", new parity is generated for the product of the multiply operation and is summed with the parity of the old Q" term. This creates the parity for the new Q term. For the P byte, the parity bits from the data are summed with the parity bit of the old P term to create the new parity bit for the new P" term. Before writing the new data back to the disk drive, the parity of Q' (calculated as indicated previously) is checked. Should Q' be incorrect, the second level controller engine will be informed of an ACC failure. In this manner, a failure in the ACC can be detected.

The same operations are performed for a failed disk drive write in transaction processing operations as for parallel data writes, except that data is not written to a failed drive or drives.

With respect to transaction processing functions during normal read operations, no action is required from the ACC logic. The actions taken by the ACC logic during a failed drive read in transaction processing mode are listed in Table 5 below, where i and j represent the first and second failed drives. The columns labelled Failed Drives indicate which drives have failed. The last column indicates what action the ACC may or may not take in response to the indicated failure.

TABLE 5 ______________________________________ Failed Drives ______________________________________ P -- Redundancy drives are not read; no ACC action Q -- Redundancy drives are not read; no ACC action i -- ACC logic calculates replacement data and performs a parallel read i P ACC logic calculates replacement data and performs a parallel read Q i ACC logic calculates replacement data and performs a parallel read j i ACC logic calculates replacement data and performs a parallel read P Q No ACC action as only data disk drives are read ______________________________________

If two data disk drives fail, the ACC logic must calculate the needed replacement data for both disk dries. If only one failed drive is to be read, both failed drives must still be noted by the ACC logic.

In the read-before-write operation (part of the read-modify-write process), the ACC logic generates P' and Q' redundancy terms. Table 6 shows the action taken by the ACC logic when a failed disk drive read precedes a write in this process. Again, i and j represent the first and second failed drives. The columns headed by Failed Drives indicate which drives have failed, and the last column denotes the response of the ACC to the indicated failures.

TABLE 6 ______________________________________ Failed Drives ______________________________________ P -- ACC calculates Q' only Q -- ACC calculates P' only i -- ACC logic takes no action and all good data disk drives are read into data buffers i P All good data disk drives are read into data buffers Q i All good data disk drives are read into data buffers i j All good data disk drives are read into data buffers i failed Perform a parallel read, the ACC logic calculates drive the replacement data for the jth failed drive. Next, the remaining good data disk drives are read into the data buffers. P Q No read before write operation is necessary ______________________________________

When a failed data disk drive is to be written, all good data disk drives must be read so that a new P and Q redundancy can be generated. All of the data from the good data disk drive and the write data is summed to generate the new redundancy. When two data disk drives fail, the ACC logic must calculate replacement data for both failed drives. If only one drive is to be read, both must be reported to the ACC logic.

During write operations, the ACC continues to calculate P and Q redundancy. Table 7 shows the ACC's tasks during failed drive writes. Here P and Q represent the P and Q redundancy term disk drives, and i and j represent the first and second failed data disk drives. The columns Failed Drives denote the particular failed drives, and the last column indicates the ACC response to the failed drives.

TABLE 7 ______________________________________ Failed Drives ______________________________________ P -- ACC calculates Q redundancy only Q -- ACC calculates Q redundancy only i -- ACC calculates P and Q redundancy i P ACC calculates Q redundancy only Q i ACC calculates P redundancy only i j ACC calculates P and Q r