United States Patent5511164
Brunmeier , ; et al.April 23, 1996

Title

Method and apparatus for determining the source and nature of an error within a computer system

Abstract

A method and apparatus for identify the source and nature of an error, without aborting the operation of the computer system. In one embodiment of the present invention, the source of the error may be a hardware element and the nature of the error may be identified as either fatal or non-fatal. If the nature of the error is considered non-fatal, the present invention may correct the error and continue the operation of the computer system. This may allow detected errors to be handled immediately after they occur, rather than aborting the operation of the computer system and waiting for a support controller or the like to analyze the error. This may significantly enhance the reliability and performance of a corresponding computer system. This may be especially important during time critical operations. Further, since the operation of the computer system may be aborted a fewer number of times, the present invention may minimize the amount of data loss. This may be particularly important for high reliability computer applications, including banking applications and airline reservation applications, where the integrity of the data base is of the utmost importance.


Inventors:Brunmeier; Terry J. (Shoreview, MN), Byers; Larry L.  (Apple Valley, MN), Miller; John A.  (Shoreview, MN), Robeck; Gary R.  (Albertville, MN)
Assignee:Unisys Corporation (Blue Bell, PA)
Appl. No.:396952
Filed:March 1, 1995

Current U.S. Class:714/53 714/54 714/15 714/24 
Current International Class:G06F 11/07 (20060101)
Field of Search:395/575,185.06,185.07,185.05,185.01,182.22,182.13,183.15 371/51.1,40.1,13,14,12,21.3

U.S. Patent Documents
3567916March 1971Fulton, Jr.
3761695September 1973Eichelberger
3887901June 1975Moore, III
3911261October 1975Taylor
3961747June 1976Small et al.
3982111September 1976Lerner et al.
4016409April 1977Kim
4040466April 1977Cordi et al.
4056844November 1977Izumi
4058851November 1977Scheuneman
4074229February 1978Prey
4084231April 1978Capozzi et al.
4092713May 1978Scheuneman
4112502September 1978Scheuneman
4130865December 1978Heart et al.
4139148February 1979Scheuneman et al.
4163147July 1979Scheuneman et al.
4195770April 1980Benton et al.
4225958September 1980Funatsu
4298929November 1981Capozzi
4298980November 1981Hajdu et al.
4308616December 1981Timoc
4349871September 1982Lary
4369511January 1983Kimura et al.
4370746January 1983Jones et al.
4379344April 1983Ozawa et al.
4393475July 1983Kitagawa et al.
4399519August 1983Masuda et al.
4417328November 1983Ochii
4426681January 1984Bacot et al.
4433413February 1984Fasang
4442487April 1984Fletcher et al.
4476431October 1984Blum
4493077January 1985Agrawal et al.
4513283April 1985Leininger
4525777June 1985Webster et al.
4531213July 1985Scheuneman
4534028August 1985Trischler
4535455August 1985Peterson
4535467August 1985Davis et al.
4546272October 1985Suzuki et al.
4556974December 1985Kozlik
4566104January 1986Bradshaw et al.
4580066April 1986Berndt
4595911June 1986Kregness et al.
4608683August 1986Shigaki
4615030September 1986Kumagai
4628217December 1986Berndt
4646229February 1987Boyle
4649475March 1987Scheuneman
4667288May 1987Keeley et al.
4670876June 1987Kirk
4688219August 1987Takamae
4701920October 1987Resnick et al.
4715034December 1987Jacobson
4755930July 1988Wilson, Jr. et al.
4757440July 1988Scheuneman
4768193August 1988Takamae
4782487November 1988Smelser
4783785November 1988Hanta
4788684November 1988Kawaguchi et al.
4794521December 1988Ziegler et al.
4807110February 1989Pomerene et al.
4835458May 1989Kim
4835774May 1989Ooshima et al.
4843542June 1989Dashiell et al.
4847519July 1989Wahl et al.
4853843August 1989Ecklund
4858234August 1989Hartwell et al.
4860192August 1989Sachs et al.
4873705October 1989Johnson
4876685October 1989Rich
4888772December 1989Tanigawa
4896323January 1990Korner et al.
4897817January 1990Katanosaka
4903266February 1990Hack
4918378April 1990Katircioglu et al.
4918695April 1990Scheuneman et al.
4918696April 1990Purdham et al.
4926426May 1990Scheuneman et al.
4962501October 1990Byers et al.
4969148November 1990Nadeau-Dostie et al.
4984153January 1991Kregness et al.
4992930February 1991Gilfeather et al.
4993030February 1991Krakhauer et al.
4996688February 1991Byers et al.
5003542March 1991Mashiko et al.
5006787April 1991Katircioglu et al.
5023776June 1991Gregor
5025365June 1991Mathur et al.
5025366June 1991Baror
5032979July 1991Hecht et al.
5034923July 1991Kuo et al.
5058006October 1991Durdan et al.
5060230October 1991Arimoto et al.
5072450December 1991Helm et al.
5089958February 1992Horton et al.
5093787March 1992Simmons
5107501April 1992Zorian
5132937July 1992Tuda et al.
5136545August 1992Takayanagi
5138619August 1992Fasang et al.
5140597August 1992Araki
5140685August 1992Sipple et al.
5146573September 1992Sato et al.
5148533September 1992Joyce et al.
5155735October 1992Nash et al.
5157781October 1992Harwood et al.
5166608November 1992Bowles
5168468December 1992Magome et al.
5173906December 1992Dreibelbis et al.
5185722February 1993Ota et al.
5193163March 1993Sanders et al.
5195185March 1993Marenin
5198758March 1993Iknaian et al.
5199034March 1993Yeo et al.
5202969April 1993Sato et al.
5222066June 1993Grula et al.
5222224June 1993Flynn et al.
5224101June 1993Popyack, Jr.
5228132July 1993Neal et al.
5241547August 1993Kim
5255230October 1993Chan et al.
5255375October 1993Crook et al.
5258958November 1993Iwahashi et al.
5267212November 1993Takashima
5274646December 1993Brey et al.
5293383March 1994Knefel
5299313March 1994Petersen et al.
5305261April 1994Furutani et al.
5307484April 1994Baker et al.
5313602May 1994Nakamura
Other References
Wilson, Jr., "Hierarchial Cache/Bus Architecture for Shared Memory Multiprocessors", Computer Society Press of the IEEE, 1987, pp. 244-252. .
Dubois et al., "Effects of Cache Coherency in Multiprocessors", IEEE Transactions on Computers, vol. C-31, No. 11, Nov.351 7 1982, pp. 1083-1099. .
Chen, "Error-Correcting Codes with Byte Error-Detection Capability", IEEE Transactions on Computers, vol. C-32, No. 7, Jul. 1983, pp. 615-621. .
Kaneda, Correspondence, "A Class of Odd-Weight-Column SEC-DED-SbED Codes for Memory System Applications", IEEE Transactions on Computers, vol. C-33, No. 8,1 Aug., 1984, pp. 737-789. .
Kaunitz et al., "Audit Trail Compaction for Database Recovery", Communications of the ACM, vol. 27, No. 7, Jul. 1984, pp. 678-683. .
Verhofstad, "Recovery Techniques for Database Systems", Computing Surveys, vol. 10, No. 2, Jun. 1978, pp. 167-194. .
Blumbergs et al., "High Speed Glitchless Cascade Latch with Set", IBM Technical Disclosure Bulletin, vol. 18, No. 5, Oct. 1975. .
Canova et al., "LSSD Compatible D-Function Latch", IBM Technical Disclosure Bulletin, vol. 25, No. 10, Mar. 1983. .
Berglund, "Level-Sensitive Scan Design Tests Chips, Boards, System", Electronics, vol. 52, No. 6, Mar. 15, 1981, pp. 108-110. .
Kuo et al., "Soft-Defect Detection (SDD) Technique for a High-Reliability CMOS SRAM", IEEE Journal of Solid-State Circuits, vol. 25, No. 1, Feb. 1990, pp. 61-66. .
Tsuda, "A Defect and Fault Tolerant Desian of WSI Static RAM Modules", 1990 International Conference on Wafer Scale Integration, pp. 213-219. .
Furutani et al., "Built-In Hamming Code ECC Circuit for DRAM's", IEEE Journal of Solid State Circuits, vol. 24, No. 1, Feb. 1989, pp. 50-56. .
Gray et al., "The Recovery Manager of the System R Database Manager", Computing Surveys, vol. 13, No. 2, Jun. 1981..~
Primary Examiner: Beausoliel, Jr.; Robert W.
Assistant Examiner: Palys; Joseph E.
Attorney, Agent or Firm:Nawrocki, Rooney & Sivertson

Claims


We claim:
1. In a computer system which executes a program, the computer system having a memory and a support controller, the memory having a number of address locations wherein the number of address locations are capable of storing a number of corresponding data elements, the computer system capable of performing a number of write operations to the memory and further capable of performing a number of read operation to the memory, the improvement comprising:
a. detecting means coupled to the memory for detecting an error within a read data element wherein the read data element is read from a selected one of the number of address locations via one of the number of read operations;
b. storing means coupled to said detecting means for storing the selected one of the number of address locations when said error is detected;
c. interrupting means coupled to said detecting means and further coupled to the computer system for temporarily interrupting the execution of the program when said detecting means detects said error;
d. testing means coupled to said detecting means for writing and reading a predetermined number of test patterns to the selected one of the number of address locations, thereby determining whether said error is a fatal error;
e. reloading means coupled said testing means and further coupled to said interrupting means for reloading a correct copy of the read data element into the selected one of the number of address locations via the support controller if said error is not a fatal error;
f. enabling means coupled to said testing means and further coupled to the computer system for enabling the computer system if said testing means determines that said error is not a fatal error; and
g. aborting means coupled to said testing means for aborting the execution of the program if said testing means determines that said error is a fatal error.

2. A computer system according to claim 1 wherein said detecting means comprises a parity check block.

3. A computer system according to claim 1 wherein said interrupting means comprises a test block.

4. An apparatus according to claim 1 wherein said testing means writes and reads a predetermined number of test patterns to a number of address locations including the selected one of the number of address locations, thereby determining whether said error is a fatal error.

5. An apparatus according to claim 1 wherein said error is deemed fatal if said error is a hard error.

6. An apparatus according to claim 1 wherein said storing means stores the selected one of the number of address locations in a register.

7. A computer system having a number of users, a support controller, and a memory having a number of address locations, the number of users being coupled to the memory via an address bus and a data bus, at least one of the number of users providing a number of addresses to the memory via the address bus, the memory providing a number of corresponding read data words to the at least one of the number of users via the data bus, comprising:
a. an error detect block coupled to the address bus and the data bus, said error detect block including:
i. error detection means coupled to the data bus for detecting an error in the number of corresponding read data words; and
ii. storing means coupled to said error detection means and further coupled to the address bus for storing a corresponding address when said error detecting means detects an error, said address corresponding to a memory location within the memory which provided said error;
b. a test block coupled to the address bus, the data bus, said error detect block, and the at least one of the number of users, said test block including:
i. interrupting means coupled to said detecting means and further coupled to the at least one of the number of users for temporarily interrupting the at least one of the number of users when said detecting means detects said error;
ii. testing means coupled to said storing means, said detecting means, and to the memory, said testing means writing and reading a number of test patterns to said corresponding address thereby determining if said error is a fatal error or a non-fatal error;
iii. reloading means coupled to said testing means for reloading a correct copy of said corresponding read data word to said corresponding address via the support controller if said testing means determines that said error is a non-fatal error;
iv. enabling means coupled to said testing means for enabling the at least one of the number of users if said testing means determines that said error is a fatal error; and
v. aborting means coupled to said testing means for aborting the at least one of the number of users if said testing means determines that said error is a fatal error.

8. A computer system according to claim 7 wherein said error detection means comprises a parity check circuit.

9. A computer system according to claim 8 wherein said storing means comprises a memory address register.

10. A computer system according to claim 9 wherein said test means comprises a test pattern generator block.

11. An apparatus according to claim 7 wherein said testing means writes and reads a number of test patterns to a number of address locations including said corresponding address, thereby determining whether said error is a fatal error.

12. An apparatus according to claim 7 wherein said error is deemed fatal if said error is a hard error.

13. In a computer system which executes a program, the computer system having a memory and a support controller, the memory having a number of address locations wherein the number of address locations are capable of storing a number of corresponding data elements, the computer system capable of performing a number of write operations to the memory and further capable of performing a number of read operation to the memory, the improvement comprising:
a. a detecting circuit coupled to the memory for detecting an error within a read data element wherein the read data element is read from a selected one of the number of address location via one of the number of read operations;
b. a storing circuit coupled to said detecting circuit for storing the selected one of the number of address locations when said error is detected;
c. an interrupting circuit coupled to said detecting circuit and further coupled to the computer system for temporarily interrupting the execution of the program when said detecting circuit detects said error;
d. a testing circuit coupled to said detecting circuit for writing and reading a predetermined number of test patterns to the selected one of the number of address locations, thereby determining whether said error is a fatal error;
e. a reloading circuit coupled said testing circuit and further coupled to said interrupting circuit for reloading a correct copy of the read data element into the selected one of the number of address locations via the support controller, if said error is not a fatal error;
f. an enabling circuit coupled to said testing circuit and further coupled to the computer system for enabling the computer system if said testing circuit determines that said error is not a fatal error; and
g. an aborting circuit coupled to said testing circuit for aborting the execution of the program if said testing circuit determines that said error is a fatal error.

14. A computer system according to claim 13 wherein said detecting circuit comprises a parity check block.

15. A computer system having a number of users, a support controller, and a memory having a number of address locations, the number of users being coupled to the memory via an address bus and a data bus, at least one of the number of users providing a number of addresses to the memory via the address bus, the memory providing a number of corresponding read data words to the at least one of the number of users via the data bus, comprising:
a. an error detect block coupled to the address bus and the data bus, said error detect block including:
i. an error detection circuit coupled to the data bus for detecting an error in the number of corresponding read data words; and
ii. a storing circuit coupled to said error detection circuit and further coupled to the address bus for storing a corresponding address when said error detecting circuit detects an error, said address corresponding to a memory location within the memory which provided said error;
b. a test block coupled to the address bus, the data bus, said error detect block, and the at least one of the number of users, said test block including:
i. an interrupting circuit coupled to said detecting circuit and further coupled to the at least one of the number of users for temporarily interrupting the at least one of the number of users when said detecting circuit detects said error;
ii. a testing circuit coupled to said storing circuit, said detecting circuit, and to the memory, said testing circuit writing and reading a number of test patterns to said corresponding address thereby determining if said error is a fatal error or a non-fatal error;
iii. a reloading circuit coupled to said testing circuit for reloading a correct copy of said corresponding read data word to said corresponding address via the support controller if said testing circuit determines that said error is a non-fatal error;
iv. an enabling circuit coupled to said testing circuit for enabling the at least one of the number of users if said testing circuit determines that said error is a fatal error; and
v. an aborting circuit coupled to said testing circuit for aborting the at least one of the number of users if said testing circuit determines that said error is a fatal error.

16. A computer system according to claim 15 wherein said error detection circuit comprises a parity check circuit.

17. A computer system according to claim 15 wherein said storing circuit comprises a memory address register.

18. A computer system according to claim 15 wherein said testing circuit comprises a test pattern generator block.

19. A method for performing error detection within a computer system wherein the computer system has a number of users, a support controller and a memory, the number of users being coupled to the memory via an address bus and a data bus, least one of the number of users executing a program therein, wherein the at least one of the number of users performing a number of read operations on the memory during the execution of the program via the address bus and the data bus, the method comprising the steps of:
a. determining if the at least one of the number of users is providing a read address to the memory via the address bus thereby performing a read operation on the memory, the memory providing a corresponding read data word to the at least one of the number of users via the data bus during said read operation;
b. determining if said corresponding read data word has an error therein;
c. interrupting the execution of the program in the at least one of the number of users, temporarily;
d. determining if said error is a fatal error or a non-fatal error;
e. aborting the execution of the program in the at least one of the number of users if said determining step (d) determines what said error is a fatal error; and
f. reloading the contents of the memory via the support controller of said determining step (d) determines that said error is non-fatal error and allowing the at least one of the number of users to continue executing the program thereafter.

20. A method according to claim 19 wherein said determining step (d) further comprises:
a. writing and reading a number of test patterns to said corresponding address of the memory.

21. A method according to claim 19 wherein at least one of the number of users periodically reads a predetermined number of read addresses to determine if an error exists in the memory.

22. A method for performing error detection in a computer system wherein the computer system has a number of users and a memory, the number of users being coupled to the memory via an address bus and a data bus, at least one of the number of users executing a program therein, wherein the at least one of the number of users performing a number of read operations on the memory during the execution of the program via the address bus and the data bus, the method comprising the steps of:
a. determining if the at least one of the number of users is providing a read address to the memory via the address bus thereby performing a read operation on the memory, the memory providing a corresponding read data word to the at least one of the number of users via the data bus during said read operation;
b. determining if said corresponding read data word has an error therein;
c. interrupting the execution of the program in the at least one of the number of users, temporarily;
d. determining if said error is a fatal error or a non-fatal error;
e. aborting the execution of the program in the at least one of the number of users if said determining step (d) determines that said error is a fatal error; and
f. reloading the contents of the memory if said determining step (d) determines that said error is a non-fatal error and allowing the at least one of the number of users to continue executing the program thereafter.

23. A method for performing error detection in a computer system wherein the computer system has a number of users, a support controller, and a memory and wherein the memory has a number of addresses, the number of users being coupled to the memory via an address bus and a data bus, at least one of the number of users executing a program therein, wherein the at least one of the number of users performing a number of read operations on the memory during the execution of the program via the address bus and the data bus, the method comprising the steps of:
a. determining if the at least one of the number of users is providing a read address to the memory via the address bus thereby performing a read operation on the memory, the memory providing a corresponding read data word to the at least one of the number of users via the data bus during said read operation;
b. determining if said corresponding read data word has an error therein by performing a parity check thereon;
c. interrupting the execution of the program in the at least one of the number of users, temporarily;
d. storing said read address thereby identifying the location of said error within the memory;
e. determining if said error is a fatal error or a non-fatal error by writing and reading a number of test patterns to said read address;
f. aborting the execution of the program in the at least one of the number of users if said determining step (e) determines that said error is a fatal error; and
g. reloading the contents of said read address of the memory if said determining step (e) determines that said error is a non-fatal error, and allowing the at least one of the number of users to continue executing the program thereafter.

24. A method according to claim 23 wherein at least one of the number of users periodically reads a predetermined number of the number of addresses to determine if an error exists therein.

25. A method according to claim 23 wherein said reloading step (g) reloads a predetermined number of the number of addresses including the read address.

26. A data processing system having a memory module, the memory module having a number of address locations wherein a number of data elements are stored in the number of address locations, comprising:
a. storage means capable of storing a number of data elements;
b. interface means coupled to the memory module and further coupled to said storage means for providing an interface between the memory module and said storage means, said data processing system performing a transfer operation thereby transferring a selected number of the number of data elements between the memory and said storage means via said interface means;
c. error detecting means coupled to the memory for detecting an error within the number of data elements that are transferred between the memory and said storage means during said transfer operation;
d. capture means coupled to said error detecting means for capturing a corresponding one of the number of address locations which corresponds to the data element has said error therein;
e. interrupting means coupled to said error detecting means and further coupled to the data processing system for temporarily interrupting the execution of said transfer operation when said error detecting means detects said error;
f. testing means coupled to said error detecting means for writing and reading a predetermined number of test patterns to the corresponding one of the number of address locations captured by said capture means, thereby determining whether said error is a fatal error;
g. reloading means coupled said testing means and further coupled to said interrupting means for reloading a correct copy of the data element containing said error, into the corresponding one of the number of address locations captured by said capture means, if said error is not a fatal error;
h. enabling means coupled to said testing means and further coupled to the data processing system for enabling the transfer operation if said testing means determines that said error is not a fatal error; and
i. aborting means coupled to said testing means for aborting the execution of the transfer operation if said testing means determines that said error is a fatal error.

27. A data processing system having a memory module, the memory module having a number of address locations wherein a number of data elements are stored in the number of address locations, comprising:
a. at least one primary power source coupled to the data processing system for providing power to the data processing system;
b. a detecting circuit coupled to said at least one primary power source for detecting a degradation in any of said at least one primary power source;
c. at least one secondary power source coupled to the data processing system and further coupled to said detecting circuit for providing power to the data processing system when said detecting circuit detects a degradation in any of said at least one primary power source;
d. at least one disk drive coupled to said at least one primary power source and further coupled to said at least one secondary power source, said at least one disk drive capable of storing a number of data elements;
e. interface means coupled to the memory module and further coupled to said at least one disk drive for providing an interface between the memory module and said at least one disk drive, said data processing system performing a downloading operation thereby downloading the number of data elements stored in the number of address locations in the memory to said at least one disk drive via said interface means, when said detecting circuit detects a degradation in any of said at least one primary power source;
f. error detecting means coupled to the memory for detecting an error within the number of data elements that are downloaded to said at least one disk drive during said downloading operation;
g. storing means coupled to said error detecting means for storing a corresponding one of the number of address locations which corresponds to the data element that has said error therein;
h. interrupting means coupled to said error detecting means and further coupled to the data processing system for temporarily interrupting the execution of the downloading operation when said error detecting means detects said error;
i. testing means coupled to said error detecting means for writing and reading a predetermined number of test patterns to the corresponding one of the number of address locations stored by said storing means, thereby determining whether said error is a fatal error;
j. reloading means coupled said testing means and further coupled to said interrupting means for reloading a correct copy of the data element containing said error into the corresponding one of the number of address locations stored by said storing means, if said error is not a fatal error;
k. enabling means coupled to said testing means and further coupled to the data processing system for enabling the downloading operation if said testing means determines that said error is not a fatal error; and
l. aborting means coupled to said testing means for aborting the execution of the downloading operation if
said testing means determines that said error is a
fatal error.

Description

CROSS REFERENCE TO CO-PENDING APPLICATIONS

The present application is related to U.S. patent application Ser. No 08/396.951, filed Mar. 1, 1995, entitled "Method and Apparatus For Storing Computer Data After a Power Failure", which is assigned to the assignee of the present invention and is incorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to general purpose digital data processing systems and more particularly relates to such systems which utilize error detection and correction schemes therein.

2. Description of the Prior Art

A key design element of high reliability computer systems is that of error detection and correction. It has long been recognized that the integrity of the data bits within the computer system is critical to ensure the accuracy of operations performed in the data processing system. The alteration of a single data bit in a data word can dramatically affect arithmetic calculations or can change the meaning of a data word as interpreted by other sub-systems within the computer system.

The cause of an altered data bit may be traced to either a "soft-error" or a "hard error" within a memory element. Soft errors are not permanent in nature and may be caused by alpha particles, electromagnetic radiation, random noise, or other non-destructive events. Soft errors are often referred to as bit-flips indicating that a bit has inadvertently been flipped from a one to a zero or visa versa. Hard errors, on the other hand, are permanent in nature and are often referred to as stuck-at faults. Typically, a hard error may be caused by a manufacturing defect in a memory element or by some other destructive event such as a voltage spike.

One method for performing error detection is to associate an additional bit, called a "parity bit", along with the binary bits comprising a data word. The data word may comprise data, an instruction, an address, etc. Parity involves summing without carry the bits representing a "one" within a data word and providing an additional "parity bit" so that the total number of "ones" across the data word, including the added parity bit, is either odd or even. The term "Even Parity" refers to a parity mechanism which provides an even number of ones across the data word including the parity bit. Similarly, the term "Odd Parity" refers to a parity mechanism which provides an odd number of ones across the data word including the parity bit.

A typical system which uses parity as an error detection mechanism has a parity generation circuit for generating the parity bit. For example, when the system stores a data word into memory, the parity generation circuit generates a parity bit from the data word and the system stores both the data word and the corresponding parity bit into an address location in a memory. When the system reads the address location where the data word is stored, both the data word and the corresponding parity bit are read from the memory. The parity generation circuit then regenerates the parity bit from the data bits read from the memory device and compares the regenerated parity bit with the parity bit that is stored in memory. If the regenerated parity bit and the original parity bit do not compare, an error is detected and the system is notified.

It is readily known that a single parity bit in conjunction with a multiple bit data word can detect a single bit error within the data word. However, it is also readily known that a single parity bit in conjunction with a multiple bit data word can be defeated by multiple errors within the data word. As calculation rates increase, circuit sizes decrease, and voltage levels of internal signals decrease, the likelihood of a multiple errors within a data word increase. Therefore, methods to detect multiple errors within a data word are essential.

System designers have developed methods for detecting multiple errors within multiple bit data words by providing multiple parity bits for each multiple bit data word. Although this technique has been successfully used, it significant increases the overhead required to perform error detection because the parity generation circuit is more complex and the additional parity bits must be stored along with each data word. It can readily be seen that each additional parity bit that is included within a system adds a significant amount of overhead to the system.

Parity generation techniques are also used to perform error correction within a data word. Error correction is typically performed by encoding the data word to provide error correction code bits that are stored along with the bits of the data word. Upon readout, the data bits read from the addressable memory location are again subject to the generation of the same error correction code signal pattern. The newly generated pattern is compared to the error correction code signals stored in memory. If a difference is detected, it is determined that the data word is erroneous. Depending on the encoding system utilized it is possible to identify and correct the bit position in the data word indicated as being incorrect. The system overhead for the utilization of error correction code signals is substantial. The overhead includes the time necessary to generate the error correction codes, the memory cells necessary to store the error correction code bits for each corresponding data word, and the time required to perform the decode when the data word is read from memory. These represent disadvantages to the error correction code system.

Error detection schemes may be used on various internal nodes of a computer system. That is, for high reliability computer systems, many of the data paths within the computer system may have an error detection scheme incorporated therein. However, because of the relatively high overhead cost associated with multiple bit error detection, usually only a limited number of parity bits or the like may be provided. Further, because of the relatively high overhead cost associated with error corrections schemes, only the most critical data paths may utilize such schemes. Finally, because error correction schemes may degrade the performance of a corresponding data path of the computer system, the use of such schemes is often precluded on time critical data paths.

In addition to the above referenced limitations, typical error detection schemes cannot determine the source or nature of an error. Rather, error detection schemes typically only identify that an error exists on a corresponding bus. Under some circumstances, it may be important to identify the underlying hardware element that is the source of the error and also identify the nature of the fault. For example, if an error is detected in a microcode instruction of a computer system, it may be important to determine the hardware source of the error and whether the error is fatal. An error may be considered fatal if the error cannot be corrected without aborting the operation of the computer system. For the example described above, the source of the error may be a memory device and the nature of the error may be a soft error or a hard error. A soft error may be corrected during the operation of the computer system by simply over-writing the correct data to the corrupted memory location, and therefore a soft error may not be deemed to be fatal. However, a hard error within the memory element cannot be corrected without aborting the operation of the computer system and replacing the memory element, and therefore a hard error may be deemed to be a fatal error.

As can be seen, a number of otherwise non-fatal errors may be deemed to be fatal because the source and nature of the error cannot be identified during the operation of the computer system. That is, because none of the prior art error detection schemes provide a mechanism for identifying the source and nature of an error during the operation of the computer system, the system may abort when it would otherwise not be necessary. The use of prior art error detection schemes may, therefore, require a computer system to assume that a non-fatal error to be a fatal error, in order to preserve the integrity of the data base. For example, any error detected in a microcode word may be considered fatal, even if the error is a soft error within a memory wherein the soft error may be corrected by simply writing a correct microcode word to the corrupted memory location. Any further error analysis may be performed by a support controller, but only after the operation of the computer system is aborted. As can readily be seen, this may limit the overall reliability and performance of the corresponding computer system.

SUMMARY OF THE INVENTION

The present invention overcomes many of the disadvantages of the prior art by providing a method and apparatus for identify the source and nature of an error, without aborting the operation of the computer system. In one embodiment of the present invention, the source of the error may be a hardware element and the nature of the error may be identified as either fatal or non-fatal. If the nature of the error is considered non-fatal, the present invention may correct the error and continue the operation of the computer system. This may allow detected errors to be handled immediately after they are detected, rather than aborting the operation of the computer system and waiting for a support controller to analyze the error. This may significantly enhance the reliability and performance of a corresponding computer system, which may be especially important during time critical operations. Further, since the operation of the computer System may be aborted a fewer number of times, the present invention may minimize the amount of data loss. This may be particularly important for high reliability computer applications, including banking applications and airline reservation applications, where the integrity of the data base is of the utmost importance.

In an exemplary embodiment of the present invention, an error detection and test block may be coupled to an address bus and a data bus of a memory element. The memory element may be coupled to a number of users wherein the number of users may write and read data to/from the memory via the address bus and the data bus. The error detection and test block may monitor the data bus during predetermined read operations of the memory element. If an error is detected, the error detection and test block may temporarily interrupt the operation of the computer system and store the corresponding read address. By storing the corresponding read address, the location of the error is identified. Thereafter, the error detection and test block may write and read a number of predetermined test patterns to the read address, and/or a predetermined range of read addresses, thereby determining if the error was caused by a soft error or a hard error.

If the error detection and test block determines that the error was caused by a soft error, the support controller may reload the contents of the memory location which corresponds to the read address, and/or the predetermined range of read addresses. It is also contemplated that the entire contents of the memory may be reloaded, or even all locations on a corresponding card. As stated above, a soft error may be considered non-fatal. Thereafter, the operation of the computer system may be resumed. If the error detection and test block determines that the error was caused by a hard error, the error may be considered fatal and the operation of the computer system may be aborted.

The exemplary embodiment may allow memory parity errors to be handled immediately after they occur, rather than aborting the operation of the computer system and waiting for a support controller to analyze the error. As discussed above, this may significantly enhance the reliability and performance of a corresponding computer system. To further help ensure system reliability, a system may periodically read and/or write each memory location within the memory wherein the present invention may perform error detection thereon. If an error is detected, the present invention may then determine the source and nature of the error as described above.

In another exemplary embodiment of the present invention, an error detection and test block may be used in conjunction with a system which downloads data elements from a cache memory to a disk drive, under limited battery backup power. In the exemplary system, which is described in more detail in the above referenced co-pending patent application which is incorporated reference, a power failure of a primary power source may trigger a download operation of a cache memory. The download operation may be performed under a limited battery backup power source. In such a system, the download operation may be time critical because all of the data elements stored in the cache memory must be downloaded before the limited battery backup power source also fails. A block move instruction may be performed wherein the data may be downloaded from the cache memory, through a data save disk controller, across a DSD bus, through a SCSI controller, and finally to a number of SCSI disk drives. The block move instruction may be controlled by a microsequencer.

An error detection and test block may be coupled to the DSD bus wherein the DSD bus may be coupled to the data save disk controller, the SCSI controller and a memory. Further, the DSD bus may comprise a data bus and an address bus. A microsequencer may be coupled to the data save disk controller, and may request to read instructions and data from the memory via the DSD bus. Further, the data save disk controller and the SCSI controller may read instructions and data from the memory via the DSD bus. Under these circumstances, when a parity error is detected in a data word read from the memory, there may not be enough time to abort the current data transfer and allow a support controller to analyze the error. Because the limited battery backup power source may not have enough power to sustain two full data transfers from the cache memory to the SCSI disk drives, it may be important that the current data transfer not be aborted. It is contemplated, however, that the current data transfer may be stopped, wherein a redundant host interface adapter may continue the current system transfers.

The error detection and test block may allow the source and nature of the error to be determined without the need to abort current system data transfers. If the nature of the error is non-fatal, the error may be corrected and the current system transfer may continue. However, if the nature of the error is determined to be fatal, the current system transfer may be aborted, and the data may be lost.

This embodiment may operate substantially the same as the previously described embodiments. That is, the error detection and test block may monitor the data bus of the memory during predetermined read operations. If an error is detected, the error detection and test block may temporarily interrupt the data transfer and may store the corresponding read address. By storing the corresponding read address, the location of the error is identified. Thereafter, the error detection and test block may write and read a number of predetermined test patterns to the read address, and/or a predetermined range of read addresses, thereby determining if the error was caused by a soft error or a hard error. If the error detection and test block determines that the error was caused by a soft error, the support controller may reload the contents of the memory location which corresponds to the read address, and/or the predetermined range of read addresses, and the operation of the computer system may be resumed. It is contemplated that the support controller may reload the entire memory or even all memory on a corresponding card. As stated above, a soft error may be considered non-fatal. If the error detection and test block determines that the error was caused by a hard error, the error may be considered fatal and the data transfer may be aborted. This may allow memory parity errors to be handled immediately after they occur, rather than aborting the data transfer and waiting for a support controller to analyze the error. As discussed above, this may significantly enhance the reliability and performance of a corresponding computer system. To further help ensure system reliability, the corresponding computer system may periodically read and/or write each memory location within the memory wherein the present invention may perform error detection thereon. If an error is detected, the present invention may then determine the source and nature of the error as described above.

While the above reference embodiments refer to a memory element, it is contemplated that the present invention may be equally applicable to other hardware elements including gates, processors, busses, I/O buffers, etc. That is, the present invention may isolate the source of an error and may further determine the nature of the error, while not requiring the operation of the computer system to be aborted. These alternative embodiments are also deemed to be within the scope of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

Other objects of the present invention and many of the attendant advantages of the present invention will be readily appreciated as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, in which like reference numerals designate like parts throughout the figures thereof and wherein:

FIG. 1 is a block diagram of an exemplary computer system incorporating an error detection and test block in accordance with the present invention;

FIG. 2 is a block diagram of another exemplary computer system incorporating an error detection and test block in accordance with the present invention;

FIG. 3 is a block diagram of the exemplary computer system of FIG. 2, showing an exemplary implementation of the error detection and test block;

FIG. 4 is a schematic diagram of an exemplary implementation of the error detect block of FIG. 3;

FIG. 5 is a block diagram of an exemplary implementation of the test block of FIG. 3;

FIG. 6 is a block diagram of an exemplary computer system which may incorporate the present invention;

FIG. 7 is a schematic diagram of an exemplary embodiment of the host interface adapter block;

FIG. 8 is a partial schematic diagram of the host interface adapter block detailing the data save disk interface;

FIG. 9A is a block diagram of the Data Save Disk Controller (DSDC) shown in FIGS. 7-8;

FIG. 9B is a block diagram showing applicable portions of the Address and Recognition Logic block of FIG. 9A;

FIGS. 10A-10B comprise a table illustrating an exemplary bus description of the DSD bus of FIG. 8;

FIG. 11 is a table illustrating an exemplary address format for the address field of the DSD bus of FIG. 8;

FIG. 12 is a timing diagram illustrating an exemplary read cycle on the DSD bus wherein the NCR chip is the master and the DSDC device is the slave;

FIG. 13 is a timing diagram illustrating an exemplary read cycle on the DSD bus wherein the NCR chip is the master and the SRAM device is the slave;

FIG. 14 is a timing diagram illustrating an exemplary read and write cycle on the DSD bus wherein the DSDC device is the master and the NCR chip is the slave;

FIG. 15 is a block diagram of the exemplary computer system shown in FIG. 6 through FIG. 14 which incorporates an exemplary embodiment of the present invention;

FIG. 16 is a schematic diagram showing another exemplary implementation of the error detect block of FIG. 15;

FIG. 17 is a flow diagram showing a first exemplary method of the present invention;

FIG. 18 is a flow diagram showing a second exemplary method of the present invention;

FIG. 19 is a flow diagram showing a third exemplary method of the present invention; and

FIG. 20 is a flow diagram showing a fourth exemplary method of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 1 is a block diagram of an exemplary computer system incorporating an error detection and test block in accordance with the present invention. The block diagram is generally shown at 10. A computer system 14 may have an error detection and test block 12 therein. Error detection and test block 12 may identify the source and nature of an error within computer system 14, without aborting the operation of computer system 14. In one embodiment of the present invention, the source of the error may be a hardware element within computer system 14, and the nature of the error may be identified as either fatal or non-fatal. If the nature of the error is considered non-fatal, the exemplary embodiment may correct the error and continue the operation of computer system 14. This may allow detected errors to be handled immediately after they occur, rather than having to abort the operation of computer system 14, and wait for a support controller (not shown) or the like to analyze the error. This may significantly enhance the reliability and performance of computer system 14, which may be especially important during time critical operations. Further, since the operation of computer system 14 may be aborted a fewer number of times, the exemplary embodiment may minimize the amount of data loss. This may be particularly important for high reliability computer applications, including banking applications and airline reservation applications, where the integrity of the data base is of the utmost importance.

FIG. 2 is a block diagram of another exemplary computer system incorporating an error detection and test block in accordance with the present invention. The block diagram is generally shown at 20. In the exemplary embodiment, an error detection and test block 22 may be coupled to an address bus 28 and a data bus 26. Address bus 28 and data bus 26 may be coupled to a memory 24 and a number of users 30,32 and may provide an interface therebetween. Although only a first user 30 and an Nth user
32 are shown, it is contemplated that any number of users may be provided. Further, the number of users 30,32 may comprise instruction processors, microsequencers, or any other device which may be coupled to a memory. The number of users 30,32 may write and read data to/from memory 24 via address bus 28 and data bus 26. Error detection and test block 22 may monitor data bus 26 during predetermined read operations of memory element 24. If an error is detected, error detection and test block 22
may temporarily interrupt the operation of the number of users via interface 34, and store the corresponding read address for analysis. By storing the corresponding read address, the location of the error may be identified. Thereafter, error detection and test block 22 may write and read a number of predetermined test patterns to the read address of memory 24 via address bus 28 and data bus 26, thereby determining if the error was caused by a soft error or a hard error. It is further contemplated that the number of predetermined test patterns may be read and written to a predetermined range of read addresses to help isolate the cause of the error. In the exemplary embodiment, if the same error exists after writing and reading a number of test patterns to the "bad" read address of memory 24, the error is assumed to be a hard error.

If error detection and test block 22 determines that the error was caused by a soft error within memory 24, error detection and test block 22 may reload the contents of memory 24. It is also contemplated that a support controller or the like (not shown) may perform the reload function. In an exemplary embodiment, error detection and test block 22 may only reload the "bad" read address location, and/or the predetermined range of read addresses, rather than the entire contents of memory 24. As stated above, a soft error may be considered non-fatal. Thereafter, the operation of the number of users 30,32 may be resumed via interface 34. If error detection and test block 22 determines that the error was caused by a hard error, the error may be considered fatal and the operation of the number of users 30,32 may be aborted via interface 34.

The exemplary embodiment may allow errors detected in memory 24 to be handled immediately after they are detected, rather than aborting the operation of the number of users 30,32 and waiting for a support controller (not shown) to analyze the error. As discussed above, this may significantly enhance the reliability and performance of a corresponding computer system. To further help ensure system reliability, a system (not shown) may periodically read and/or write each memory location within the memory wherein the present invention may perform error detection thereon. If an error is detected, the present invention may then determine the source and nature of the error as described above.

FIG. 3 is a block diagram of the exemplary computer system of FIG. 2, showing an exemplary implementation of the error detection and test block. Error detection and test block 22 may comprise an error detect block 54 and a test block 56. Error detect block 54 may be coupled to address bus 28 and data bus 26. Error detect block 54 may monitor data provided to data bus 26 and detect any errors thereon. In an exemplary embodiment, error detect block 54 may monitor all read operations performed by the number of users 30,32 on memory 24. When an error is detected on data bus 26, error detect block 54 may store the read address which is present on address bus 28. In this way, the location of the error within memory 24 may be identified. That is, the particular address location within memory 24 which produced the fault on data bus 26 may be identified. Error detect block 54 may then provide the "bad" read address and an error signal to test block 56 via interfaces 58 and 60, respectively.

Test block 56 may write and read a number of predetermined test patterns to the "bad" read address of memory 24, via interfaces 62 and 64. It is further contemplated that test block 56 may write and read the number of predetermined test patterns to a predetermined range of read addresses to help isolate the error. Interfaces 62 and 64 may be coupled to data bus 26 and address bus 28, respectively. By writing and reading a number of predetermined test patterns to the "bad" read address of memory 24, test block 56 may determine if the error was caused by a soft error or a hard error. The predetermined patterns may include a parity pattern, a checkerboard pattern, an all zeros pattern, an all ones pattern, etc., or any combination thereof. In the exemplary embodiment, if the same error exists after writing and reading a number of test patterns to the "bad" read address of memory 24, the error is assumed to be a hard error.

If test block 56 determines that the error was caused by a soft error within memory 24, test block 56 may reload the contents of memory 24 via interfaces 62 and 64. It is contemplated that a support controller or the like (not shown) may perform the reload function. In an exemplary embodiment, test block 56 may only reload the "bad" read address location, and/or the predetermined range of read addresses, rather than the entire contents of memory 24. It is contemplated, however, that the entire contents of memory 24 may be reloaded, or even all devices on a corresponding card. As stated above, a soft error may be considered non-fatal. Thus, test block 54 may enable the number of users 30,32 to resume operation via interface 34. If test block
56, however, determines that the error was caused by a hard error, the error may be considered fatal and the test block 56 may abort the operation of the number of users 30,32 via interface 34.

The exemplary embodiment may allow errors detected in memory 24 to be handled immediately after they are detected, rather than aborting the operation of the number of users 30,32 and waiting for a support controller (not shown) to analyze the error. As discussed above, this may significantly enhance the reliability and performance of a corresponding computer system. To further help ensure system reliability, a system (not shown) may periodically read and/or write each memory location within the memory wherein the present invention may perform error detection thereon. If an error is detected, the present invention may then determine the source and nature of the error as described above.

FIG. 4 is a schematic diagram of an exemplary implementation of the error detect block of FIG. 3. The schematic diagram is generally shown at 100. The error detect block of FIG. 3 is generally shown at 54. As indicated above, error detect block 54 may be coupled to address bus 28 and data bus 26. Error detect block 54 may monitor data words provided to data bus 26 and detect any errors therein. In an exemplary embodiment, error detect block 54 may monitor all read operations performed by the number of users 30,32 on memory 24. When an error is detected on data bus 26, error detect block 54 may store the corresponding read address which is present on address bus 28. In this way, the location of the error within memory 24 may be identified. That is, the particular address location within memory 24 which produced the error may be identified. Error detect block 54 may then provide the "bad" read address and an error signal to test block 56 via interfaces 58 and 60, respectively.

In the exemplary embodiment, a decoder 102 may be coupled to address bus 28. Decoder 102 may monitor address bus 28 and determine if a read operation is being performed on memory 24. In an exemplary embodiment, address bus 28 may comprise a number of bits which may indicating if a read operation is being performed on memory 24. For example, referring to FIG. 11, slave select bits 972 may indicate if memory 24 is currently being accessed. Further, R/W field 970 may indicate if a read operation or a write operation is being performed thereon. Decoder 102 may decode these bits and assert interface 110 when a read operation of memory 24 is being performed. During each bus cycle, the corresponding address on address bus 28 may be latched into a register 104.

A register 108 may be coupled to data bus 26. Register 108 may be enabled by decoder 102 via interface 110. As indicated above, interface 110 may be asserted by decoder 102 when a read operation is being performed on memory 24. Register 108
may store the corresponding data word on data bus 26 whenever a read operation is being performed on memory 24. Similarly, a register 106 may latch the output of decoder 102, thereby enabling latch 116 via interface 107. That is, when decoder 102
detects a read operation of memory 24, register 106 stores the value on interface 110, thereby causing latch 116 to go transparent. Meanwhile, register 108 provides the corresponding data word to parity check block 112 via interface 114. Parity check block 112 may check the parity of the data word. Although a parity error detection technique is used in the exemplary embodiment, it is contemplated that any other error detection means may be used. If an error is detected by parity check block 112, an error signal may be provided to latch 116 via interface 118. Since register 106 has asserted the enable input of latch 116 via interface 107, latch 116 is transparent and the error signal may be provided to error line 60.

The error signal may also be provided to an enable input of a register 120 via interface 122. Register 120 may be coupled to register 104. When an error signal is provided to the enable input of register 120, the corresponding address stored in register 104 may be latched into register 120. That is, the corresponding address location of memory 24 may be latched into register 120 when an error is detected during a read operation of memory 24. The corresponding "bad" address may be provided to test block 56 via interface 58.

If a read operation is not detected by decoder 102, interface 110 is not asserted and the corresponding data word is not latched into register 108. Further, latch 116 is not enabled thereby prohibiting the output of parity check block 118 from reaching interface 60.

FIG. 5 is a block diagram of an exemplary implementation of the test block of FIG. 3. The block diagram is generally shown at 150. The test block of FIG. 3 is generally shown at 56. As indicated above, test block 56 may write and read a number of predetermined test patterns to the "bad" read address of memory 24, via interfaces 62 and 64. Interfaces 62 and 64 may be coupled to data bus 26 and address bus 28, respectively. By writing and reading a number of predetermined test patterns to the "bad" read address of memory 24, test block 56 may determine if the error was caused by a soft error or a hard error. As stated above, it is further contemplated that a predetermined range of read addresses may be written and read as described above. This may help isolate the cause of the error. The predetermined patterns may include a parity pattern, a checkerboard pattern, an all zeros pattern, an all ones pattern, etc., or any combination thereof. In the exemplary embodiment, if the same error exists after writing and reading a number of test patterns to the "bad" read address of memory 24, the error is assumed to be a hard error.

If test block 56 determines that the error was caused by a soft error within memory 24, test block 56 may reload the contents of memory 24 via interfaces 62 and 64. It is also contemplated that a support controller or the like (not shown) may reload the contents of memory 24. In an exemplary embodiment, test block 56 may only reload the "bad" read address location, rather than the entire contents of memory 24. However, it is contemplated that the entire contents of memory 24 may be reloaded, or even all devices on a corresponding card. As stated above, a soft error may be considered non-fatal. Thus, test block 54 may enable the number of users 30,32 to resume operation via interface 34. If test block 56, however, determines that the error was caused by a hard error, the error may be considered fatal and test block 56 may abort the operation of the number of users 30,32 via interface 34.

The exemplary embodiment may allow errors detected in memory 24 to be handled immediately after they are detected, rather than aborting the operation of the number of users 30,32 and waiting for a support controller (not shown) to analyze the error. As discussed above, this may significantly enhance the reliability and performance of a corresponding computer system.

Referring specifically to FIG. 5, test block 56 may have a control block 152, a test pattern generator block 158, a data register 164, an address register block 154, a data I/O buffer block 168, and an address I/O buffer block 172. Control block
152 may receive the "bad" address and an error signal from error detect block 54 via interfaces 58 and 60, respectively (see FIG. 4). Control block 152 may temporarily interrupt the operation of the number of users 30,32 via interface 34. Control block
152 may further initiate a test sequence of the "bad" address location by providing the "bad" address to test pattern generator block 158. Further, control block 152 may provide the "bad" address to address register 154 via interface 156.

Test pattern generator block may perform a number of read and write operations to the "bad" address. For a write operation, test pattern generator block 158 may notify control block 152 that a write operation of memory 24 is desired, via interface 162. Control block 152 may then provide the "bad" address along with the necessary control bits to complete the requested read operation. For example, and referring to FIG. 11, control block 152 may provide the appropriate slave select bits
972 and may further provide the appropriate R/W bit 970 to affect the desired write operation.

Test pattern generator block 258 may then provide a data word to data register 164 via interface 166. The data word, the address, and the appropriate control signals may be provided to data bus 26 and address bus 28 via I/O buffer blocks 168 and
172. Memory 24 may then write the data word to the corresponding "bad" address location.

Test pattern generator block 158 may then perform a read operation of the "bad" address. For a read operation, test pattern generator block 158 may notify control block 152 that a read Operation of memory 24 is desired. Control block 152 may then provide the "bad" address along with the necessary control bits to complete the requested read operation. For example, and referring to FIG. 11, control block 152 may provide the appropriate slave select bits 972 and may further provide the appropriate R/W bit 970 to affect the desired read operation.

Memory 24 may then read the data word from the corresponding "bad" address location. Memory 24 may provide the read data word to data register 164 via I/O buffer block 168. Register 164 may then provide the read data word to test pattern generator block 158 via interface 166. Thereafter, test pattern generator block 158 may then compare the data word that was written to the "bad" address of memory 24 with the data word that was read from the "bad" address of memory 24. In the exemplary embodiment, if the same error exists after writing and reading a data word to the "bad" read address of memory 24, the error is assumed to be a hard error.

Test pattern generator block 158 may write and read a number of test patterns to the "bad" address of memory 24. By writing and reading a number of predetermined test patterns to the "bad" read address of memory 24, test pattern generator block
158 may determine if the error was caused by a soft error or a hard error. The predetermined patterns may include a parity pattern, a checkerboard pattern, an all zeros pattern, an all ones pattern, etc., or any combination thereof.

If test pattern generator block 158 determines that the error was caused by a soft error, test pattern generator block 158 may reload the contents of memory 24 via interfaces 62 and 64, as described above. It is contemplated that a support controller or the like (not shown) may perform the reload function rather than test pattern generator block 158. In an exemplary embodiment, test pattern generator block 158 may only reload the "bad" read address location, rather than the entire contents of memory 24. However, it is contemplated that the entire contents of memory 24 may be reloaded, or even all devices on a corresponding card. As stated above, a soft error may be considered non-fatal. Thus, control block 152 may enable the number of users 30,32 to resume operation via interface 34. If test pattern generator 158, however, determines that the error was caused by a hard error, the error may be considered fatal and control block 152 may abort the operation of the number of users 30,32 via interface 34.

The exemplary embodiment may allow errors detected in memory 24 to be handled immediately after they are detected, rather than aborting the operation of the number of users 30,32 and waiting for a support controller (not shown) to analyze the error. As discussed above, this may significantly enhance the reliability and performance of a corresponding computer system. To further help ensure system reliability, a system (not shown) may periodically read and/or write each memory location within the memory wherein the present invention may perform error detection thereon. If an error is detected, the present invention may then determine the source and nature of the error as described above.

FIG. 6 is a block diagram of an exemplary computer system which may incorporate the present invention. The block diagram is generally shown at 500. The XPC comprises an instruction processor 512, an IO processor 516, a host disk storage 520, an outbound File Cache block 528, and a host main storage 510. Instruction processor 512 receives instructions from host main storage 510 via interface 514. Host main storage 510 is also coupled to MBUS 518. I/O processor 516 is coupled to MBUS 518 and is further coupled to host disk storage 520 via interface 522. In the exemplary embodiment, outbound File Cache block 528 is coupled to MBUS 518 through a first data mover 524 and a second data mover 526. Outbound File Cache block 528 may comprise two separate power domains including a power domain-A powered by a universal power source (UPS) and battery backup power source 562 via interface 564, and a power domain-B powered by a UPS power source and battery backup power source 566 via interface 568. The separation of power domain-A and power domain-B is indicated by line 560. UPS and battery backup blocks 562 and 566 may have a detection means therein to detect when a corresponding primary power source fails or becomes otherwise degradated.

Power domain-A of outbound file cache 528 may comprise a host interface adapter 534, a system interface block 536, and a portion of a nonvolatile memory 540. Host interface adapter 534 may be coupled to data mover 524 via fiber optic link 530
and may further be coupled to system interface block 536 via interface 538. System interface block 536 may be coupled to nonvolatile memory 540 via interface 542, as described above. Similarly, host interface adapter 544 may be coupled to data mover
526 via fiber optic link 532 and may further be coupled to system interface block 546 via interface 548. System interface block 546 may be coupled to nonvolatile memory 540 via interface 550, as described above.

The data may be transferred from the host disk storage 520 through I/O processor 516 to host main storage 510. But now, any updates that occur in the data are stored in nonvolatile memory 540 instead of host disk storage 520, at least momentarily. All future references then access the data in nonvolatile memory 540. Therefore, nonvolatile memory 540 acts like a cache for host disk storage 520 and may significantly increases data access speed. Only after the data is no longer needed by the system is it transferred back to host disk storage 520. Data movers 524 and 526 are used to transmit data from the host main storage 510 to the nonvolatile memory 540 and vice versa. In the exemplary embodiment, data movers 524 and 526 perform identical cache functions thereby increasing the reliability of the overall system. A more detailed discussion of the XPC system may be found in the above reference co-pending application, which has been incorporated herein by reference.

In accordance with the present invention, a data save disk system 552 may be coupled to host interface adapter 534 via interface 554. Similarly, data save disk system 556 may be coupled to host interface adapter 544 via interface 558. Data save disk systems 552 and 556 may comprise SCSI type disk drives and host interface adapters 534 and 544, respectively, may provide a SCSI interface thereto. In this configuration, the data elements stored in nonvolatile memory 540 may be downloaded directly to the data save disk systems 552 and 556. This may permit computer system 500 to detect a power failure in a power domain, switch to a corresponding backup power source 562 or 566, and store all of the critical data elements stored in nonvolatile memory 540 on SCSI disk drives 552 or 556 before the corresponding backup power source 562 or 566 also fails.

The primary power sources may comprise a universal power source (UPS) available from the assignee of the present invention. The backup power sources may comprise a limited power source, like a battery. Typical batteries may provide power to a computer system for only a limited time. For some computer systems, a large battery or multiple batteries may be required to supply the necessary power. Further, because the power requirements of some computer systems are substantial, the duration of the battery source may be very limited. It is therefore essential that the critical data elements be downloaded to a corresponding data save disk system 552 or 556 as expediently as possible.

In the exemplary embodiment, backup power source 562 may only power a first portion of nonvolatile memory 540, host interface adapter 534, system interface 536, and data save disk system 552. Similarly, backup power source 566 may only power a second portion of nonvolatile memory 540, host interface adapter 544, system interface 546, and data save disk system 556. In this configuration, the remainder of computer system 500, including instruction processor 512, I/O processor 516, host main storage 510, and host disk storage 520, may not be powered after the primary power source fails. This may allow backup power sources 562 and 566 to remain active for a significantly longer period of time thereby allowing more data to be downloaded from nonvolatile memory 540. In this embodiment, host interface adapters 534 and 544 may have circuitry to support the downloading of the critical data elements to the SCSI disk drives 552 and 556, without requiring any intervention by instruction processor
512 or I/O processor 516.

Coupling data save disk systems 552 and 556 directly to host interface adapters 534 and 544, respectively, rather than to instruction processor 512 or I/O processor 516 may have significant advantages. As indicated above, it may be faster to download the data elements directly from nonvolatile memory 540 to data save disk systems 552 or 556, rather than providing all of the data to I/O processor 516 and then to host disk storage 520. Further, significant power savings may be realized by powering only the blocks in outbound file cache 528 and the corresponding data save disk systems 552 or 556, thereby allowing more data to be downloaded before a corresponding backup power source 562 or 566 fails. Finally, data save disk systems 552 and
556 may be dedicated to storing the data elements in nonvolatile memory 540 and thus may be appropriately sized.

In a preferred mode, once the data save operation has begun, it continues until all of the data in nonvolatile memory 540 has been transferred to the data save disk system. Thereafter, the data save disks are spun down and the outbound file cache 528 is powered down to minimize further drain on the battery backup power source. If the primary power source comes back on during the data save operation, the data save is still completed, but the outbound file cache 528 is not powered down. When primary power is restored, the operation of computer system 500 may be resumed beginning with a data restore operation, but only after the battery backup power source has been recharged to a level which could sustain another primary power source failure.

The data restore operation occurs after normal computer system 500 initialization, including power-up, firmware load, etc. However, before a data restore operation is allowed to begin, the presence of saved data on a corresponding data save disk must be detected. Prior to initiating the data restore operation, the USBC microcode (see FIG. 7) compares the present computer system 500 configuration with the configuration that was present when the data save operation was executed. If the two configurations are not an exact match, the data restore operation is not executed and an error is indicated.

A data save disk set may be added to the outbound file cache 528 as a single or redundant configuration. A single data save set may save one copy of the nonvolatile memory 540 contents, and is used when there is only one Universal Power Source (UPS) 562 driving the outbound file cache 528 and data save disks. A redundant data save disk configuration may have two data save disk sets (as shown in FIG. 6) and may save two copies of the nonvolatile memory contents. In the redundant configuration, one set of data save disk drives may be powered from one UPS while the another set of data save disk drives may be powered by another UPS.

FIG. 7 is a schematic diagram of an exemplary embodiment of the host interface adapter block. For illustration, Host Interface Adapter (HIA) 534 of FIG. 6 is shown. It is recognized that HIA 544 may be similarly constructed. HIA 534 may comprise two Microsequencer Bus Controllers (USBC) 640, 642 which may be connected to a control store 644 via interface 646. The USBC's 640, 642 may access the HIA stations 628, 622, 618, and 636 via a micro bus 638. A player+0 602 and a player+1 600
may receive frames (or data elements) over fiber optic link 530. The term player+ refers to a fiber optic interface controller available from National Semiconductor which is called the Player Plus Chip Set. Player+0 602 may forward its frame to light pipe control 604 via interface 606. Similarly, player+1 600 may forward its frame to light pipe control 604 via interface 606. Light pipe control 604 may transfer the frames to a Receive Frame Transfer Facility (REC FXFA) 608 via interface 610. REC FXFA 608 may unpack the frames and may store control information in a Request Status Control Table-0 (RSCT-0) 628 and a RSCT-1 622 via interface 620. RSCT-0 628 and RSCT-1 622 may monitor the data that has been received from a corresponding data mover. The data which was contained in the frame received by REC FXFA 608 may be sent to the Database Interface (DBIF) station 618 via interface 620. DBIF 618 may forward the data over interface 632 to the streets.

Data received by the DBIF 618 from the streets via interface 548, may be sent to the Send Frame Transfer Facility (SEND FXFA) 612 via interface 626. Control information received via interface 630 may be sent to RSCT-0 628 and RSCT-1 622. SEND FXFA 612 may take the data and the control information provided by RSCT-0 628 and RSCT-1 622 via interface 624, and format a frame for transmission by light pipe control 604. Acknowledgements from REC FXFA 608 may be provided to SEND FXFA 612 via interface 616. The frame may be forwarded to light pipe control 604 via interface 614. Light pipe control 604 may create two copies of the frame received by SEND FXFA 612, and may provided a first copy to player+0 602 and a second copy to player+1 600
via interface 606. The frames may then be transmitted over the fiber optic links 530 to a corresponding data mover.

Referring back to control store 644, control store 644 may be used to store the instructions that are executed by USBC0 640 and USBC1 642. Control store 644, although in reality a RAM, is used as a read-only memory (ROM) during normal operation. Control store 644 may comprise seven (7) SRAM devices (not shown). Each SRAM device may hold 32 * 1024 (K) 8-bit bytes of data. Each unit of data stored in control store 644 may comprise 44 bits of instruction, 8 bits of parity for the instruction, and
2 bits of address parity.

Control store 644 may be loaded with instructions at system initialization by a support computer system through a maintenance path (not shown). The parity bits and address bits are computed by a host computer system and appended to each instruction as it is stored. Later, as USBC0 640 and USBC1 642 read and execute the instructions, each instruction is fetched from control store 644 and parity values are computed from it. Each USBC compares the parity values computed against the parity checks stored in control store 644. If there are any discrepancies, control store 644 is assumed to be corrupted and an internal check condition is raised in the corresponding USBC's.

USBC0 640 and USBC1 642 are special purpose microprocessors that execute instructions to monitor and control the transfer of data on micro bus 638. There are two USBC's in the system to ensure that all data manipulations are verified with duplex checking. One of the USBC's 640 is considered to be the master while the other USBC1 642 is considered the slave. Only the master USBC0 640 drives the data on the micro bus 638, but both master USBC0 640 and slave USBC1 642 drive address and control signals to lower the loading on micro bus 638. The slave USBC1 642 may send the result of each instruction to the master USBC0 640 via interface 648. The master USBC0 640 may then compare this value to the result it computed. If the values are different, an internal check error condition is set and the program is aborted. A further discussion of the operation of HIA 534 may be found in the above referenced co-pending application, which is incorporated herein by reference.

In accordance with the present invention, a data save disk controller (DSDC) 636 may be coupled to micro bus 638 and may thus communicate with USBC0 640 and USBC1 642. DSDC 636 is further coupled to DBIF 618 via interfaces 634 and 626. DSDC may receive data elements from DBIF 618 via interface 626 and may provide data elements to DBIF 618 via interface 634. DSDC 636 is further coupled to a DSD block 666 via a DSD bus 650. In the exemplary embodiment, DSDC 636 may be coupled to DSD block 666
via a DSD address bus 652, a DSD data bus 654, and a number of control signals. DSD block 666 may be coupled to a data save disk system 552 via interface 554. DSD block may provide the interface function between DSDC 636 and data save disk system 552. A network interface module (NIM) 635 may be coupled to DSDC 636 via interface 633. NIM 635 may provide maintenance functions to DSDC 636, and to other elements within the system. USBC0 640 and USBC1 642 may control the operation of a download and/or upload operation between a nonvolatile memory 540 and data save disk system 552. This may include providing a timer function to delay the download and/or upload operation for a predetermined time period.

In this configuration, data save disk system 552 is directly coupled to nonvolatile memory 540 via DSD block 666, DSDC 636, DBIF 618, and system interface 536 (see FIG. 6). When a primary power source fails, the data elements stored in nonvolatile memory 540 may be downloaded directly to the data save disk system 552 without any intervention by an instruction processor 512 or I/O processor 516. This configuration may have a number of advantages. First, the speed at which the data elements may be downloaded from nonvolatile memory 540 to data save disk system 552 may be enhanced due to the direct coupling therebetween. Second, significant power savings may be realized because only HIA 534, data save disk system 552, system interface 536, and non-volatile memory 540 need to be powered by the secondary power source to effect the download operation. This may significantly increase the amount of time that the secondary power source may power the system thereby increasing the number of data elements that can be downloaded.

Similarly, once the primary power source is restored, data save disk system 552 may upload the data elements directly to nonvolatile memory via DSD block 666, DSDC 636, DBIF 618, and system interface block 536, without any assistance from an instruction processor 512 or I/O processor 516. This may provide a high speed upload link between data save disk system 552 and nonvolatile memory 540.

FIG. 8 is a partial schematic diagram of the host interface adapter block detailing the data save disk interface. DSD block 666 may comprise a memory 680, a disk controller 682, and a set of transceivers 684. A DSD bus 650 may couple DSDC 636, memory 680, and disk controller 682, and may comprise an address bus 652, and a data bus 654. DSD bus 650 may further comprise a number of disk controller control signals 651, and a number of memory control signals 653. DSD bus 650 may operate generally in accordance with a standard master/slave bus protocol wherein the DSDC 636, disk controller 682, and memory 680 may be slave devices, but only DSDC 636 and disk controller 682 may be master devices. That is, memory 680 may not be a master device in the exemplary embodiment.

Disk controller 682 may be coupled to transceivers 684 via interface 686. Transceivers 684 may be coupled to data save disk system 552 via interface 554. In a preferred mode, interface 554 may be a SCSI interface. Disk controller 682 may be a SCSI disk controller and data save disk storage system 552 may comprise at least one SCSI disk drive. In a preferred embodiment, disk controller 682 may be a NCR53C720 SCSI I/O Processor currently available from NCR corporation. Further, the at least one SCSI disk drives of data save disk storage 552 may comprise Hewlett Packard C3010 5.25" drives, Fijitsu M2654 5.25" drives, or Seagate ST12550/ND 3.5" drives. The data save disk system may comprise a set of 2-GByte SCSI Disks in sufficient quantity to store a single copy of the entire contents of the XPC. The NCR I/O processor may provide the necessary SCSI interface between DSDC 636 and the at least one disk drives of data save disk system 552.

As indicated with reference to FIG. 7, USBC0 640 and USBC1 642 may be coupled to MBUS 638. Further, USBC0 640 and USBC1 642 may be coupled to control store 644 via interface 646. DSDC 636 may be coupled to micro bus 638, DBIF 618, and DSD block
666.

Memory 680 may comprise at least one RAM device. In a preferred mode, memory 680 comprises four RAM devices. Because the disk storage system is an addition to an existing HIA design, control store 644 may not have enough memory locations to store the added pointers and temporary data needed to support the data save disk function. Therefore, a primary function of memory 680 is to store the pointers and temporary data for USBC0 640 and USBC1 642 such that HIA 534 may support the disk data save function. Another primary function of memory 680 is to store SCRIPTS for disk controller 682. SCRIPT programs and the application thereof are discussed in more detail below. Additions to the USBC microcode which may be stored in memory 680 may provide the following functionality: (1) initialization of the data save disk system 552 and microcode control areas; (2) data save operation which may copy all of the data and control elements from nonvolatile memory 540 to data save disk system 552; (3) data restore operation which may copy all of the data and control elements from data save disk system 552 to nonvolatile memory 540; (4) checking the status of the disks in data save disk storage system 552 and informing maintenance if restore data exists thereon; and (5) various error detection and error handling subroutines.

As indicated above, USBC0 640 and USBC1 642 may read pointers and/or temporary data or the like from memory 680 through DSDC 636. To accomplish this, USBC0 640 and USBC1 642 may provide an address to DSDC 636 wherein DSDC 636 may arbitrate and obtain control of DSD bus 650. Once this has occurred, DSDC 636 may provide the address to memory 680. Memory 680 may then read the corresponding address location and provide the contents thereof back to DSDC 636 via DSD bus 650. DSDC 636 may then provide the pointers and/or temporary data or the like to USBC0 640 and USBC1 642 for processing. By using this protocol, USBC0 640 and USBC1 642 may obtain pointers and/or temporary data from memory 680 to control the operation of a download and/or upload operation between nonvolatile memory 540 and data save disk system 552. This may include providing a timer function to delay the download and/or upload operation for a predetermined time period.

Data save disk system 552 is directly coupled to nonvolatile memory 540 via DSD block 666, DSDC 636, DBIF 618, and system interface 536 (see FIG. 6). When a primary power source fails, and under the control of USBC0 640 and USBC1 642, DBIF 618
may read the data elements from nonvolatile memory via interface 630 wherein DBIF 618 may provide the data elements to DSDC 636 via interface 626. DSDC 636 may then perform arbitration for DSD bus 650, wherein the data elements may be read by disk controller 682. In this instance, disk controller 682 may be the bus master. Disk controller 682 may then provide the data elements to transceivers 684 wherein the data elements may be written to data save disk system 552. This configuration may have a number of advantages. First, the speed at which the data elements may be downloaded from nonvolatile memory 540 to data save disk system 552 may be enhanced due to the direct coupling therebetween. Second, significant power savings may be realized because only HIA 534, system interface 536, non-volatile memory 540, and data save disk system 552 need to be powered by the secondary power source to effect the download operation. This may significantly increase the amount of time that the secondary power source may power the system thereby increasing the number of data elements that may be downloaded.

Similarly, once the primary power source is restored, data save disk system 552 may upload the data elements directly to nonvolatile memory via DSD block 666, DSDC 636, DBIF 618, and system interface block 536, without any assistance from an instruction processor 512 or I/O processor 514. This may provide a high speed upload link between data save disk system 552 and nonvolatile memory 540.

FIG. 9A is a block diagram of the Data Save Disk Controller (DSDC) shown in FIGS. 7-8. The block diagram is generally shown at 636. DSDC 636 may comprise a DSD bus arbitration and control block 702 which may control the arbitration of DSD bus
650. DSD bus arbitration and control 702 may determine which device may assume the role of bus master of DSD bus 650. Preemptive priority is used to determine which device becomes bus master when more than one device is requesting bus mastership at any given time. In the exemplary embodiment, the priority order of bus mastership, from high priority to low priority, may be as follows: disk controller 682, USBC blocks 640, 642, and finally network interface module (NIM) 635. Memory 680 is not allowed to assume bus mastership of DSD bus 650 in the exemplary embodiment. DSD bus arbitration and control block 702, may be coupled to disk controller 682 via interface 651 (see FIG. 8). Interfaces 704 may be a bus request from disk controller 682 and interface 706 may be a bus acknowledge signal to disk controller 682.

In an exemplary embodiment, when disk controller 682 assumes bus mastership, it may relinquish bus ownership after a maximum of 16 bus cycles. Disk controller 682 may then wait 5 clock cycles before asserting a bus request to regain bus mastership. The 5 clock cycles provides a "fairness" delay to allow DSDC 636 to gain bus mastership if required.

DSDC 636 may comprise at least four basic data paths. A first basic data path may provide an interface between DBIF 618 and DSD bus 650. This path may comprise a register 706, a multiplexer 710, a register 712, a FIFO block 714, a register 716, a multiplexer 718, a data-out-register 720, and an I/O buffer block 722. Register 706 may receive data elements from DBIF 618 via interface 626. Register 706 may be coupled to multiplexer 710 via interface 724. Also coupled to interface 724 may be a parity check block 708. Parity Check block 708 may check the parity of a data element as it is released from register 706.

Multiplexer 710 may select interface 724 when transferring data between DBIF 618 and DSD bus 650. The data may then be provided to register 712 via interface 726 wherein register 712 may stage the data for FIFO 714. The data may then be provided to FIFO 714 via interface 728. Also coupled to interface 728 may be a parity check block 730. Parity Check block 730 may check the parity of a data element as it is released from register 712.

FIFO 714 may comprise a 34 bit by 64 word FIFO. FIFO 714 may function as a buffer between DBIF 618 and DSD bus 650. This may be desirable because disk controller 682 may have to arbitrate for DSD bus 650, thus causing an unpredictable delay. FIFO 714 may store the data that is transferred by DBIF 618 to DSDC 636 until disk controller 682 is able to gain control of DSD bus 650. Once disk controller 682 gains access to DSD bus 650, FIFO 714 may wait for eight (8) words to be transferred from DBIF 618 to FIFO 714 before sending the data over DSD bus 650.

Once released by FIFO 714, the data may be provided to register 716 via interface 732. Register 716 may store the output of FIFO 714. The data may then be provided to multiplexer 718 via interface 734. Multiplexer 718 may select interface 734
when transferring data between DBIF 618 and DSD bus 650. The data may then be provided to data-out-register 720 via interface 736, wherein data-out-register 720 may stage the data for I/O buffer block 722. Parity conversion block 738 may provide a two to four bit parity conversion. That is, data arriving from DBIF 618 via multiplexer 718 may only have two parity bits associated therewith. It may be desirable to convert the two parity bits to a four parity bit scheme. Data-out-register 720 may then provide the data to I/O buffer block 722 via interface 740. I/O buffer block 722 may comprise a plurality of bi-directional transceivers wherein each of the transceivers may be enabled to drive the data onto DSD bus 650 via interface 654.

A second basic data path of DSDC 636 may provide an interface between DSD bus 650 and DBIF 618. This path may comprise I/O buffer block 722, a data-in-register 742, multiplexer 710, register 712, FIFO block 714, register 716, a multiplexer 744, a register 746, a multiplexer 748, and a register 750. For this data path, I/O buffer block 722 may be enabled to accept data from DSD bus 650 and provide the data to data-in-register 742 via interface 752. Data-in-register 742 may provide the data to multiplexer 710 via interface 754. Also coupled to interface 754 may be a parity check block 756. Parity Check block 756 may check the parity of a data element as it is released by data-in-register 742. Parity conversion block 758 may provide a four to two bit parity conversion. That is, data arriving from DSD bus 650 may have four parity bits associated therewith while DBIF interface 634 may only have two parity bits associated therewith. It may be desirable to convert the four parity bits to a two parity bit scheme.

Multiplexer 710 may select interface 754 when transferring data between DSD bus 650 and DBIF 618. The data may then be provided to register 712 via interface 726 wherein register 712 may stage the data for FIFO 714. The data may then be provided to FIFO 714 via interface 728. Also coupled to interface 728 may be parity check block 730. Parity Check block 730 may check the parity of a data element as it is released from register 712.

FIFO 714 may function as a buffer between DSD bus 650 and DBIF 618. This may be desirable because DBIF 618 may have to wait to gain access to the streets via interface 632. FIFO 714 may store data that is transferred by DSD bus 650 until DBIF
618 can gain access to the streets.

Once released by FIFO 714, the data may be provided to register 716 via interface 732. Register 716 may store the output of FIFO 714. The data may then be provided to multiplexer 744 via interface 760. Multiplexer 744 may select the data provided by register 716 during a data transfer between DSD bus 650 and DBIF 618. Multiplexer 744 may then provide the data to register 746 via interface 762. Register 746 may then provide the data to multiplexer 748 via interface 764. Multiplexer 748
may select 16 bits at a time of a 32 bit word provided by register 746. This may be necessary because the DSD bus may comprise a 32 bit word while the interface to DBIF 618 may only be 16 bits wide. Also coupled to interface 764 may be parity check block 768. Parity Check block 768 may check the parity of a data element as it is released from register 746. Multiplexer 748 may then provide the data to register 750. Register 750 may provide the data to DBIF 618 via interface 634.

A third basic data path of DSDC 636 may provide an interface between MBUS 638 and DSD bus 650. This path may comprise a I/O buffer block 770, a register 772, an address decode and recognition logic block 780, a multiplexer 774, a register 776, multiplexer 718, data-out-register 720, and I/O buffer block 722. For this data path, USBC's 640, 642 may provide a request to DSDC 636 via MBUS 638. The request may comprise a data word, an address, and/or a number of control signals. In the exemplary embodiment, a request comprising an address and a number of control signals may be provided over MBUS 638 first wherein a data word may follow on MBUS 638, if appropriate. I/O buffer block 770 may receive the request via interface 638 and may provide the request to register 772 via interface 784. Register 772 may provide the request to multiplexer 774 and to an address decode and recognition block 780 via interface 786. Also coupled to interface 786 may be a parity check block 788. Parity Check block 788 may check the parity of the request as it is released from register 772. Multiplexer 774 may select interface 786 during transfers from MBUS 638 to DSD bus 650. Multiplexer 774 may provide the request to register 776 via interface 790. Register 776 may then provide the request to multiplexer 718 via interface 792. Also coupled to interface 792 may be a parity check block 778. Parity Check block 778 may check the parity of the request as it is rel