Home
Patent Search
IMT Blog
REGISTER
|
SIGN IN
United States Patent
5388189
Kung
February 7, 1995
Title
Alarm filter in an expert system for communications network
Abstract
An Expert System 10 for providing diagnostics to a data communications network 5. Alarms from a Network Manager 24 are received and queued by an Event Manager 117 and then filtered by an Alarm Filter 118 to remove redundant alarms. Alarms which are ready for processing are then posted to a queue referred to as a Bulletin Board 120. A Controller 112 determines which one of the posted goals has the highest priority by considering a priority number associated with the goal plus a time of arrival of the goal. An Inference Engine 122 uses information from an Expert Information Structure 111 to solve the highest priority goal by a process called instantiation.
Inventors:
Kung; Ching Y.
(Fort Lauderdale,
FL
)
Assignee:
Racal-Datacom, Inc.
(Sunrise,
FL
)
Appl. No.:
142565
Filed:
October 28, 1993
Current U.S. Class:
706/45
706/911
706/917
Field of Search:
395/916,917,50,917,908,909 340/517,521,506,522 371/4
U.S. Patent Documents
4385384
May 1983
Rosbury et al.
4388715
June 1983
Renaudin et al.
4591983
May 1986
Bennett et al.
4642782
February 1987
Kemper et al.
4648044
March 1987
Hardy et al.
4649515
March 1987
Thompson et al.
4656603
April 1987
Dunn
4658370
April 1987
Erman et al.
4670848
June 1987
Schramm
4675829
June 1987
Clemenson
4704695
November 1987
Kimura et al.
4713775
December 1987
Scott et al.
4740886
April 1988
Tanifuji et al.
4749985
June 1988
Corsberg
4752889
June 1988
Rappaport et al.
4752890
June 1988
Natarajan et al.
4754409
June 1988
Ashford et al.
4754410
June 1988
Leech et al.
4763277
August 1988
Ashford et al.
4767277
August 1988
Ashford et al.
4812819
March 1989
Corsberg
4816994
March 1989
Freiling et al.
4817092
March 1989
Denny
4829426
May 1989
Burt
4841456
June 1989
Holan et al.
4873687
October 1989
Breu
4965676
October 1990
Ejiri et al.
4972953
November 1990
Daniel, II et al.
4977390
December 1990
Saylor et al.
4999833
March 1991
Lee
5058033
October 1991
Bonissone et al.
5167010
November 1992
Elm et al.
5227121
July 1993
Scarola et al.
Foreign Patent Documents
01224845
Sep., 1989
JP
2206713A
Jan., 1989
GB
62-175060
Jul., 1987
JP
62-52601
Mar., 1987
JP
63-124148
May., 1988
JP
63-98741
Apr., 1988
JP
Other References
Wollenberg, B. F., "Feasibility Study for an Energy Management System Intelligent Alarm Processor," IEEE Trans. on Power Systems, May 1986, 241-247. .
Wilson, C.; "Network tools share AT&T spotlight," (Summary), Telephony, Nov. 6, 1989, 12(2). .
"Anatomy of a Diagnostic System", Bonnie Merritt, AI Expert, Sep. 1987, pp. 52-63. .
Texas Instruments, "Procedure Consultant User's Guide", May 1988. .
Texas Instruments and CGI-Test Bench "From Application Shell to Knowledge Acquisition System", Aug. 1987. .
"Representing Procedural Knowledge in Expert Systems: An Application To Process Control", M. Gallanti et al., pp. 345-352, Aug. 18-25, 198. .
Papsaxe, Expert Systems for Experts, pp. 152-156. .
Conar, "Expert Systems Solve Network Problems and Share the Information" Data Communications, May 1986, pp. 187-190. .
Amelink et al., "Dispatcher Alarm and Message Processing", IEEE Trans on Power Systems, v. PWRS-1, N. 1 Aug. 1986, pp. 188-194. .
Schulte et al., "Artificial Intelligence Solutions to Power System Operating Problems", IEEE Trans. on Power Systems,v. PWRS 2, n 4, Nov. 1987 pp. 920-926. .
Paula, "Expert System Manages Alarm Messages", Electrical World, Jul. 1989, pp. 47-48. .
Komai et al., "Artificial Intelligence Method for Power System Fault Diagnosis", 2nd Internal Conf. on Power Sys. Monitoring and Control, 1986, pp. 355-360. .
Fulvi et al., "An Expert System for Fault Section Estimation Using Information from Protective Relays and Circuit Breakers", IEEE Trans. on Power Delivery, v. PWRS-1, n. 4 Oct. 1983 pp. 83-90. .
Wullenberg, "Feasibility Study for an Energy Management System Intelligent Alarm Processor", IEEE Trans. on Power Systems, v. PWRS-1, n. 2, May 1986, pp. 241-247. .
Gross, "Applications of Artificial Intelligence Technology in Communications Networks", Expert Systems, Aug. 1988, v. 5 n 3, pp. 248-251. .
Cynar, "Computers Design Networks by Imitating the Experts", Data Communications, Apr. 1986, pp. 137-143. .
Miyazaki et al., "Dynamic Operation and Maintenance Systems for Switching Networks", IEEE Communications Magazine, Sep. 1990, pp. 34-35. .
Wollenberg, "Feasibility Study For An Energy Management System Intelligent Alarm Processor", IEEE Trans. on Power Systems, v. PWRS-1, n. 2, May 1986, pp. 241-247. .
Gevarter, W. P., "The Nature and Evaluation of Commercial Expert System Building Tools," Computer, May 1987, 24-41. .
Johnson et al., Expert Systems Architectures, Kogan Page Limited, 6-27, 1988. .
Slagle, J. R., "Applications of a Generalized Network-Based Expert System Shell", Proc. Symp. on the Engineering of Computer-Based Medical Systems. .
Na et al., "The Design of an Object-Oriented Modular Expert System Shell," Proc. 1990 Symp. on Applied Computing, Apr. 1990, 109-118..~
Primary Examiner:
MacDonald; Allen R.
Assistant Examiner:
Downs; Robert W.
Attorney, Agent or Firm:
Newton; William A.
Parent Case Text
This application is a division of application Ser. No. 07/802,113, filed Dec. 4, 1991 now U.S. Pat. No. 5,295,230 granted Mar. 15, 1994, which is hereby incorporated by reference, which in turn is a division of Ser. No. 447,485, filed Dec. 6, 1989, now U.S. Pat. No. 5,159,685, granted Oct. 27, 1992, which is hereby incorporated by reference.
Claims
What is claimed is:
1. In an expert system for providing diagnostic services for a communication network including a plurality of network devices capable of generating alarms in response to network problems, a method of processing said alarms reported from said network devices, comprising the steps of:
receiving a current said alarm from one of said network devices;
comparing said current alarm with other said alarms previously received from other said network devices to determine whether said current alarm corresponds to one of said previously received alarms in that both said current alarm and said previously received alarms serve to report a common said network problem in said network;
said comparing step further including retrieving information from a database regarding a portion of said network's topology and determining from said portion if any of said network devices providing said previously received alarms and said one network device producing said current alarm could produce said previously received alarms and said current alarm, respectively, in response to said common network problem; and
discarding said current alarm if said current alarm and any of said previously received alarms are determined to be in produced by said common common problem.
2. The method of claim 1, wherein said comparing step compares a time of occurrence of said current alarm with times of occurrence of said previously received alarms.
Description
COPYRIGHT NOTICE
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
BACKGROUND
1. Field of the Invention
This invention relates generally to the field of artificial intelligence, and more particularly to an expert system interfaced to, or forming a part of, a data communications network management system which automates network alarm handling and assists the network operator in isolating network problems.
2. Background of the Invention
Traditionally, data communications network management systems have concentrated on providing a set of fault isolation and test functions that allow an operator to locate, diagnose and isolate network problems.
Network problems are often expressed by the target network devices or objects (e.g. modems, multiplexers, etc. in the data communication environment) in the form of alarms or other error messages. Alarms can generally be considered events reported by target network devices when abnormal conditions exist. In some networks, alarms are generated autonomously while in others the alarms are actually responses to queries (polls). Although perhaps the former is more appropriately referred to as an alarm, both will be referred to as alarms for purposes of this document. Upon receiving the alarms from the network, the network management system displays the alarms on the operator's console. One of the network operator's responsibilities is to interpret the alarm and then isolate and resolve the problem associated with the alarm in the shortest time span. The operator then uses a series of test procedures to determine the exact cause of the problem. Once found, he may take remedial actions (such as calling for repair or switching in redundant equipment) and then move on to the next alarm.
Sometimes the operator may have difficulty in keeping up with the alarms since a single problem may result in many alarms from affected target network devices (network objects). In such cases, often the operator either ignores them, or just waits until a complaint call arrives. Furthermore, due to the different levels of network operators' experience in dealing with network faults, the problem could get further complicated because of wrong decisions in attempting to diagnose the problem and more time than necessary may be taken to solve the original problem. Such delays can be costly in large networks which are heavily relied upon to quickly move vast amounts of data in short periods of time to carry out the normal course of business. For example, large financial institutions rely upon such systems to move large sums of money electronically. Loss of that ability even for a relatively short period of time may be very costly to the institution. Similarly, airlines rely upon such systems to track passenger reservations and loss of that ability can result in fight delays or cancellations and loss of customers.
In a typical network management environment, a heterogeneous array of switching and transmission equipment may produce hundreds of alarms each day. Moreover, alarms are sometimes spurious, transient, redundant, time correlated, or too numerous to be handled at the same time. This makes a network fault diagnosis task a complex problem where considerable experience is required to interpret and isolate network faults.
Some experienced (expert) network operators acquire or develop strategies and "rules of thumb" in diagnosing networks. It is desirable to encode such knowledge into a knowledge base and make the best expert assistant available at all times, and at all locations. Ultimately, the benefits of routine use of such a system (called an expert system) include reduced operational cost, less down time, increased network performance, more effective fault management in the network, and the ability to build and effectively manage bigger networks.
A major difficulty with typical expert systems is the bottleneck encountered in acquiring knowledge from the expert. The job of a knowledge engineer is to act as an agent, or go-between to help a domain expert build a knowledge-based system. This task usually involves time consuming interviews, lengthy documentation and refinement, and transformation of the acquired knowledge into Artificial Intelligence (AI) based languages or representations. Often, the knowledge engineer and domain expert must work together to debug, extend, and refine the system iteratively. This is usually attributable to the fact that the knowledge engineer has far less domain knowledge than the expert and the expert has far less knowledge about artificial intelligence than the knowledge engineer. Such communication gaps constantly impede the progress and the process of transferring domain expertise into a knowledge-based system. Ultimately, this may lead either to a long development cycle or a failing system. To further complicate the matter, providing expert information is a continuing need in data communications networks since the networks tend to expand and become larger and more complex while adding new and different equipment as time goes on. With this evolution of the network comes an evolution of the products connected to the network (e.g. analog modems to digital-devices) and with it a change in the knowledge required to diagnose the network.
A second problem with typical expert systems is that as the complexity of the application domain increases, the classical rule-based system is not adequate. Knowledge management (knowledge acquisition, validation, and maintenance) is also a serious problem when the rule-based system evolves to a certain size. It has been claimed (see Buchanan and Short life, 1984, Rule-Based Expert Systems, Addison-Wesley Publishing Company; or Hayes-Roth Fredrick, 1985, "Rule-Based System", Communications of ACM) that the benefit of the rule approach is the ease of modification and extension of the system because rules can be added independently at any time. However, more recent articles (see Brug, A. Bachant, J. McDermott, J., FALL 1986, "The Taming of RI", IEEE EXPERT; or Jackson, P. 1986, Introduction to Expert Systems, International Computer Science Series; or Rauch-Hindin, W. 1987, Artificial Intelligence in Business, Science, and Industry, Vol 1 & 2, Prentice-Hall) have proven in many cases that this is not true for medium to large systems such as large data communication networks.
For medium to large diagnostic systems, the rule-based approach has suffered from at least the following problems:
--lack of methodology;
--need for knowledge engineers to transfer knowledge into rules;
--difficult to control program behavior;
--limited generic processing;
--unanticipated rule interactions during rule updates; and
--systems with a large number of rules are difficult to manage, validate and maintain.
One alternative to the problems with traditional rule-based expert systems is flow-chart-based knowledge representation. In the flow-chart knowledge representation scheme, the domain knowledge base is simply represented as decision-trees (or flow-charts), similar to the way that many repair manuals are designed. Each decision node in the flow-chart is represented by an object--schema (data structure plus its associated procedures with inheritance). Node objects represent tests, and arcs represent the outcomes of tests leading to the next node object. A separate Inference Engine is constructed to reason through and traverse among flow-chart nodes. This flow-chart approach is particularly attractive in its knowledge acquisition capability. The domain expert can enter his domain knowledge directly into the system by simply manipulating the flow-chart objects by filling in predefined schematic forms.
The following merits are experienced by using the flow-chart knowledge representation in capturing the domain knowledge:
--domain knowledge is transparent and explicit;
--knowledge acquisition is simplified;
--flow-chart browsing can be used to examine the relations among objects in a more systematic manner;
--flow-chart Inference Engine is completely separated from the flow-chart knowledge bases;
--inference processing is quick and effective due to its deterministic nature of the flow-chart representation;
--facilitates fast incremental knowledge acquisition and verification cycle; and
--reduced risk in knowledge maintenance.
However, with the pure flow-chart-based knowledge representation scheme, there are still some deficiencies that have been realized in the course of capturing domain knowledge, such as:
--lack of formal methodology and knowledge structuring;
--lack of goal (hypothesis) directed reasoning capability;
--lack of top-down problem decomposition methodology;
--state of the world is often not adequately represented;
--incomplete and unreliable heuristic knowledge cannot be fully captured and expressed; and
--monotonic reasoning is inadequate for large diagnostics systems.
The present invention ameliorates these difficulties in an expert system with advantages such as an enhanced User Interface, Inference Engine and knowledge representation as described below.
SUMMARY OF THE INVENTION
This invention provides an improved expert system with an enhanced ability to interface directly with the expert, largely bypassing the need for a knowledge engineer and speeding up the knowledge acquisition process. It does so by providing a user friendly interactive interface from which the domain expert can usually directly enter the knowledge into the knowledge base. Ultimately, the benefits of routine use of embodiments of such a system include reduced operational cost, less down time, increased network performance, more effective fault management in the network, and the ability to build and manage bigger networks. In addition, the invention provides a mechanism for filtering redundant alarms, providing several modes of operation, prioritizing goals, suspending or pausing operation as well as other features.
The following objects, features and advantages are met by one or more embodiments of the present invention.
It is an object of the present invention to provide a knowledge base which the domain expert can quickly and easily initialize, debug, display and maintain with minimal use of a knowledge engineer.
It is an advantage of the present invention to provide a knowledge based system to assist network operators in isolating network faults.
It is a further advantage of the present invention to be capable of preempting the current diagnostic process to deal with the more urgent ones, and then continue processing the original diagnosis from where it left off.
It is another advantage of some embodiments of the present invention to employ non-monotonic reasoning. Often in network diagnostics, there will only be enough information to hypothesize as to the problem. When more information becomes available, it is used to refine its hypotheses.
It is a further advantage of the present invention that the system is easily modified as expert knowledge and the underlying system under diagnostic changes.
It is a further advantage of the present invention that labor intensiveness of knowledge acquisition, documentation, verification, validation and maintenance is reduced.
These and other objects, features and advantages of the invention will become apparent to those skilled in the art upon consideration of the following description of the invention.
In a data communication network according to one embodiment of the invention, a method of processing alarms from network objects, includes the steps of: receiving a first alarm; determining whether or not the first alarm is a redundant alarm by comparing the first alarm previously received alarms; and placing the first alarm in a queue for processing by an inference engine if the first alarm is not a redundant alarm.
In another embodiment of the invention, a method of processing events in an expert system, includes the steps of: receiving a first event; determining whether or not the first event is a redundant event by comparing the first event with other events which have been received; and placing the first event in a queue for processing by an inference engine if the first event is not a redundant event.
A method for prioritizing events for processing by an inference engine according to an embodiment of the invention, includes the steps of: receiving an event; translating the event into a goal; classifying the goal as one of a plurality of goal types; assigning a priority number to the goal based upon the importance attributed to the goal type of the goal by a domain expert; tagging the goal with a time associated with occurrence of the event; and determining that the goal has a higher prioritization than another goal with the same priority number based upon the time.
An expert network diagnostic system for diagnosing problems in a communication network in a semi-automatic mode according to the present invention, includes a network manager for performing diagnostic tests on the network, the diagnostic tests including non-interruptive tests which do not significantly impact operation of the network and interruptive tests which require interfering with function a device in the network while the interruptive test is performed. An Expert System determines an appropriate one of the diagnostic tests to be performed to diagnose a problem with the network. It is determined whether the appropriate test is an interruptive or non-interruptive test and the appropriate test is invoked if the appropriate test is non-interruptive. Consent of an operator is obtained to perform the appropriate test if the appropriate test is interruptive.
An expert diagnostic system for providing diagnostics to a diagnostic target in a semi-automatic mode according to an embodiment of the present invention includes means for performing diagnostic tests. An appropriate one of the diagnostic tests to perform to diagnose a problem is selected. It is determined if the appropriate test meets a predetermined criteria. The appropriate test is invoked if the appropriate test meets the criteria. Consent of an operator is obtained to invoke the appropriate test if the appropriate test does not meet the predetermined criteria.
In an expert system according to an embodiment of the present invention, a method of applying expert knowledge to a goal, includes the steps of: posting the goal to a display; retrieving a knowledge tree corresponding to the goal; instantiating the knowledge tree; posting information from nodes of the knowledge tree to the display as each node is instantiated so that only information from instantiated nodes appear on the display.
In an expert system for providing diagnostic services for a communication network, a method according to an embodiment of the present invention for processing events reported from the network, includes the steps of: receiving a current event from the network; comparing the current event with other events previously received from the network to determine whether the current event corresponds to a previously received event in that both the current event and the previously received event serve to report a common problem in the network; and discarding the current event if the current event corresponds to the previously received event.
A method for applying expert knowledge to an alarm in a system to be diagnosed by an expert system residing on a computer in one embodiment of the present invention, includes in combination the steps of: receiving an alarm from the system; mapping the alarm to a corresponding expert knowledge source, the corresponding expert knowledge source being one of a plurality of available expert knowledge sources; retrieving the corresponding expert knowledge source; instantiating at least a portion of the corresponding expert knowledge source; and invoking an inference engine to find a solution to the alarm using the instantiated portion of corresponding expert knowledge source.
A method for entering expert information into an expert system residing on a computer, according to an embodiment of the present invention, includes the steps of: defining a hypothesis tree node by entering attributes of the hypothesis tree node into the expert system, the attributes including a first identifier for the hypothesis tree node and a second identifier for a node connected to the hypothesis tree node by a branch of the hypothesis tree; adding the first identifier to a list containing defined nodes; determining whether or not the second identifier is on the list of defined nodes; determining whether or not the second identifier is not a list of undefined nodes if the second identifier is not on the list of defined nodes; and adding the second identifier to the list of undefined nodes if the second identifier is not already on the list of undefined nodes.
In another embodiment of the present invention, an expert system for use on a computer, includes a mechanism for adding a node for entering expert information represented by nodes of a knowledge source by entering attributes of the nodes into a template. The knowledge source includes at least a hypothesis tree. node terminating in a flow-chart node. The attributes of the node include a node identifier attribute which gives a name to a node being added, a node type attribute which describes fundamental characteristics of the node being added and which distinguishes between hypothesis tree type nodes and flow-chart type nodes, and a points-to attribute which gives the name of node branching from the node being added.
In an expert system according to the present invention, a method of applying expert knowledge to a goal, includes the steps of: posting the goal to a goal queue in a memory; retrieving a knowledge tree corresponding to the goal; instantiating the knowledge tree; posting information from nodes of the knowledge tree to the memory as each node is instantiated so that only information from instantiated nodes are posted to the memory.
In the preferred embodiment of the present invention, an Expert System 10 provides diagnostics to a data communications network 5. Alarms from a Network Manager 24 are received and queued by an Event Manager 117 and then filtered by an Alarm Filter 118 to remove redundant alarms. Alarms which are ready for processing are then posted to a queue referred to as a Bulletin Board 120. A Controller 112 determines which one of the posted goals has the highest priority by considering a priority number associated with the goal plus a time of arrival of the goal. An Inference Engine 122 uses information from an Expert Information Structure 111 to solve the highest priority goal by a process called instantiation. The process of solving the goal may be interrupted by a pause or suspension in order to perform tests under the direction of a Network Test Manager 124 or retrieve other information during which time other goals may be processed. Expert information is entered using a user friendly User Interface 104 which reduces need for the participation of a Knowledge Engineer. Configuration information about the network is maintained in a Network Structure Knowledge Base 109 by a Network Configuration Module 108. The Expert System 10 may operate in any of three modes: manual, wherein tests must be approved by or directed by an operator; automatic, where tests are run automatically without operator intervention; and semiautomatic, where operator approval is required for certain tests such as interruptive tests and other tests such as non-interruptive tests may proceed without operator intervention.
The features of the invention believed to be novel are set forth with particularity in the appended claims. The invention itself, however, both as to organization and method of operation, together with further objects and advantages thereof, may be best understood by reference to the following description taken in conjunction with the accompanying drawing.
BRIEF DESCRIPTION OF THE DRAWING
FIG. 1 is a block diagram of a network and network management system interconnected with the Expert System of the present invention.
FIG. 2 is a functional block diagram of an embodiment the invention.
FIG. 3 is a high level flow-chart showing the processing flow for each phase of operation of the Expert System of the present invention.
FIG. 4 is a goal state transition diagram for the operation of the present invention.
FIG. 5 shows a flow-chart of the operation of the User Interface Module 104.
FIG. 6 shows a flow-chart of the operation of procedure Add Node.
FIG. 7 shows a flow-chart of the operation of procedure Get Node Attributes.
FIG. 8 shows a flow-chart of the operation of procedure Modify Node Information.
FIG. 9 shows a flow-chart of the operation of procedure Modify Node Type.
FIG. 10 shows a flow-chart of the operation of procedure Delete Node.
FIG. 11 shows a flow-chart of the operation of procedure Copy Node.
FIG. 12 shows an example of the results of procedure Display Knowledge Source.
FIG. 13 shows a flow-chart of the operation of procedure Show Knowledge Source.
FIG. 14 shows a flow-chart of the operation of procedure Load Knowledge Source.
FIG. 15 shows a flow-chart of the operation of procedure Save Knowledge Source.
FIG. 16 shows a flow-chart of the operation of procedure Clear Knowledge Source.
FIG. 17 shows a flow-chart of the operation of procedure Change Test Mode.
FIG. 18 shows a flow-chart of the operation of procedure Run.
FIG. 19 shows a flow-chart of the operation of procedure Run one which runs the ENDS 10 for one goal only.
FIG. 20 shows a flow-chart of the operation of procedure Bulletin Board Status which shows the status of the Bulletin Board 120.
FIG. 21 shows a flow-chart of the operation of procedure Resume which resumes operation on a paused Goal.
FIG. 22 shows a flow-chart of the operation of procedure Exit which exits the ENDS.
FIG. 23 shows a functional block diagram of a hypothetical automobile diagnostic system used to assist in explaining the present invention.
FIG. 24 shows a simplified Expert Information Structure for an automobile diagnostic system.
FIG. 25, which is broken down into FIGS. 25A and 25B due to size, shows a simplified Expert Information Structure for a data communication network diagnostic system.
FIG. 26 shows a flow-chart of the operation of the Controller 112.
FIG. 27 shows a flow-chart of the operation of the Event Manager 117 in retrieving events from the Network Manager 24.
FIG. 28 shows a portion of the Alarm Queue 114.
FIG. 29 shows a portion of the Response Queue 116.
FIG. 30 shows a portion of the Configuration Queue 113.
FIG. 31 shows a flow-chart of the operation of the Event Manager 117 in sending events to the Network Manager 24.
FIG. 32 shows a portion of the Request Queue 115.
FIG. 33 shows a flow-chart of the operation of the Alarm Filter 118.
FIG. 34 illustrates the process of instantiation by the Inference Engine 122.
FIG. 35, which is broken down into FIGS. 35A, 35B, 35C and 35D due to size shows a flow-chart of the operation of the Inference Engine 122 of the present invention.
FIG. 36 shows an overview of the Bulletin Board 120 as constructed by the Inference Engine using the example automobile diagnostic system.
FIG. 37 shows a flow-chart of the operation of the Network Test Manager 124 in queuing requests for information from the Network Manager 24.
FIG. 38 Shows a flow-chart of the operation of the Network Test Manager 124 in retrieving responses from the Event Manager 117.
FIG. 39 shows a flow-chart of the operation of the Network Configuration Module 108.
FIG. 40 shows an example screen display for the Knowledge Acquisition process of the present invention.
FIG. 41 shows an example screen display for the System Operation process of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
The present invention has broad applications to diagnostic systems in general and may be readily adapted to a broad variety of problems. In particular, the present invention uses a representation of knowledge, referred to herein as "Expert Information Structure" or "Structured Flow Graph" knowledge representation, and a user interface and method of processing knowledge which greatly enhances the extraction of expert knowledge from a domain expert and facilitates the diagnosis of problems in a data communication network. The preferred implementation is a data communication network diagnostic environment as will be described below in greater detail. The invention itself, however, should not be so limited since it may be broadly applicable to many types of diagnostics systems.
Environment of the Preferred Embodiment
Turning now to the drawings in which like reference numerals represent like or similar structures throughout the various figures, FIG. 1 illustrates an exemplary data communication network 5 interconnected with an Expert Network Diagnostic System (ENDS) 10 and a Network Manager 24. The Expert Network Diagnostic System 10 may be synonymously referred to herein as ENDS, Expert System and the like. The ENDS 10 performs diagnostics functions to the diagnostic target network 5 in the preferred embodiment, but the invention itself should not be so limited since other diagnostic targets can support a similar Expert System.
The ENDS 10 may be based on an engineering workstation 12 such as a Sun/3.TM. workstation or other suitable host on which a multiprocessing operating system 14 such as the Unix.TM. operating system and an Expert System Programming Environment
16 such as Carnegie Group's Knowledge Craft.TM. software has been installed. Object-Oriented Programming (OOP) languages such as C++ can also be used to implement the present invention. By careful programming, such a system could be made efficient enough to operate on a personal computer or the like. The Operating System 14 manages the input and output of data to the ENDS 10 as well as the scheduling of ENDS 10 processes in a known manner. The Expert System Programming Environment 16 compiles, interprets and translates the code of the ENDS 10 processes. To facilitate input, output and storage to ENDS 10, a terminal 18 with built in display, printer 20 and disk drive 21 may be attached to the workstation 12.
The ENDS 10 communicates with Network Manager 24 via a connection 22 (for example, an RS 232 connection). Network Manager 24 may be similar to Network Management systems such as CMS.RTM. series network management systems commercially available from Racal-Milgo, 1601 N. Harrison Parkway, Fort Lauderdale, Fla. Such network management systems are further disclosed in Rosbury et al. U.S. Pat. No. 4,385,384 which is hereby incorporated by reference. The Network Manager 24 is preferably based upon a minicomputer 26, such as a DEC Microvax II.TM. minicomputer, or engineering workstation on which a multiprocessing operating system 28 such as the Unix.TM. operating system and a database manager 30 such as the Oracle.RTM. database manager by Oracle Corporation have been installed. Via connection 22, alarms are passed from the Network Manager 24 to ENDS 10. Similarly commands, much like those an expert operator would enter from the Network Manager 24's terminal 33 are sent via connection 22
from the ENDS 10 to the Network Manager 24. In other embodiments, the Network Manager 24 and ENDS 10 may be installed on a Microcomputer and other environments may occur to those skilled in the art. The operating system manages the input and output of data to the Network Manager as well as the scheduling of Network Management processes. The database manager 30 manages data describing the network configuration in a conventional manner. To facilitate input to and output from the Network Manager 24, a disk drive 32 terminal 33 and printer 34 are attached to the minicomputer.
In the present example the diagnostic target, a data communication network 5, includes a host computer (Host) 40 coupled to a Front End Processor (FEP) 42 via connection 44 which is further coupled to a network of objects in this example including data modems and multiplexers. In general, the network may also include other objects such as Digital Service Units (DSU's), encryption devices, restoral devices, switches, terminal adapters, Packet Assemblers and Disassemblers (PAD's) as well as other such devices. In general, the network 5 shown in FIG. 1 is very simple compared to real life data networks and is intended only for illustrative purposes.
In the example network 5, three distinct branches emerge from the FEP 42. A first branch is made up of a point to point analog connection in which a central modem 46 is connected to the FEP 40 to receive and send information thereto. The modem
46 is connected via an analog transmission line 48 to a remote modem 50. Remote modem 50 is in turn connected to a terminal 52 or other data terminal equipment (DTE) via connection 54.
A second branch of the network starts with central modem 58 which is coupled to the FEP 42. This modem 58 feeds a multidrop connection (a connection where more than one modem is directly served by the same transmission line) via a transmission line 60. The first drop on this multidrop transmission line 60 feeds a remote modem 62 which in turn is connected to a terminal 64. The second drop feeds a remote modem 66 which is connected to a terminal 68. Similarly, the third drop feeds a remote modem 70 which is connected to a terminal 72.
A third branch of the network, starts with a multiplexer 74 which is interconnected with the FEP 42 via four connections. The multiplexer 74 has its output driving a high speed modem 78. Modem 78 is coupled through a point-to-point analog transmission line 80 to a similar modem 84. Modem 84 is then coupled to a multiplexer 86 which is in turn coupled to terminals 88, 90, 92 and 94 (and/or other data terminal equipment (DTE) devices).
The Network Manager 24 communicates with the network objects via, for example, RS232 connections 96, 97, 98 and 99 to central cite objects such as multiplexer 74 and modems 46, 58 and 78 respectively. These objects communicate with the remaining network objects via a multiplexed secondary diagnostics channel (frequency division multiplexed in the preferred embodiment). Modems capable of doing so are commercially available as the Racal-Milgo Omnimode.RTM. series modems. Multiplexers with such capabilities are commercially available as the Racal-Milgo Omnimux.RTM. series multiplexers. These connections enable the Network manager 24 to communicate with the entire network 5 through these central site objects. Those skilled in the art will appreciate that the network shown in FIG. 1 is somewhat simplified compared to real world networks and is presented only as a mechanism for understanding the general environment of the invention.
Messages passed between the Network Manager 24 and the network objects include alarms, informational messages and instructions. Network objects can signify malfunctions by sending alarms to the Network Manager 24. Network objects can send informational messages in response to requests from the Network Manager 24. The Network Manager 24 can send instructions to network objects to perform diagnostic tests or other functions such as loop back tests or switching functions.
The Network Manager 24 and the central network objects 46, 58, 74 and 78 can exchange alarms, informational messages and instructions directly or through a dedicated network such as a dial-up network or an X.25 network. Remote objects, i.e. network objects located at remote sites, communicate with the Network Manager 24 through the central objects via a multiplexed in-band or out-of-band secondary diagnostic channel in a known manner.
As previously mentioned, some network management system's diagnostics are not autonomous alarm based. However, equivalent information is generally available at the Network Manager 24 for use by the expert system 10. For purposes of this discussion, all will be referred to as alarms.
When the Network Manager 24 receives an alarm, it informs the operator via the terminal 33 and/or printer 34. The operator can then use the terminal 33 to send an instruction to the object which sent the alarm. For example, he can instruct the sending object to perform diagnostic tests and send back the results. Upon determining the nature of the malfunction, the operator can send remedial instructions to the object such as lowering transmission speed or switching in a redundant object or he can take other corrective actions such as contacting the telephone company for repairs.
Through the connection 22, the ENDS 10 can communicate with the network 5 as if it were the operator of the Network Manager 24. The Network Manager 24 forwards alarms to ENDS 10. ENDS 10 can then send instructions to network objects requesting more information or initiating various diagnostic tests. Upon determining the nature of the malfunction, the ENDS 10, in conjunction with the Network Manager 24 can send remedial instructions in some cases such as switching in redundant equipment or rerouting traffic.
Those skilled in the art will appreciate that the system and network shown in FIG. 1 is intended to be illustrative and that many variations are possible. For example, the Network Manager 24 and the ENDS 10 could possibly coexist on the same computer system in an alternative embodiment so that duplicate operating systems, disk drives, printers and terminals may not necessarily be required. Of course, such a system might require a more powerful multitasking computer system than those individually required by the Network Manager 24 and the ENDS 10 in the illustrative embodiment in order to achieve similar performance speed. Similarly, great variety can exist in the actual communications network. It should also be recalled that the Expert System of the present invention can be used for purposes other than network diagnostics.
Overview of the Invention
The architecture of the ENDS 10 is illustrated in some detail by the functional block diagram of FIG. 2 while the overall operational flow is described in conjunction with FIG. 3 and the defined states of goals are covered in FIG. 4. In order to understand the invention, its overall structure is first briefly presented in conjunction with FIG. 2. The overall flow of operation will then be discussed in conjunction with FIG. 3, followed by a discussion of the state diagram of FIG. 4. At this point a more detailed discussion of the interaction and operation of each of the individual components of FIG. 2 will proceed.
ENDS 10, in its preferred form, comprises several parts: a User Interface Module 104, a Network Configuration Module 108, a Static Knowledge Base 109 comprising a network structure knowledge base 110 and an expert information structure 111, a Controller 112, an Event Manager 117, an Alarm Filter 118, a Bulletin Board 120, an Inference Engine 122, and a Network Test Manager 124, as seen in FIG. 2.
The basic function of each of these components is described below. Although some of the terminology used in these brief descriptions has not yet been introduced, this summary will be useful as a glossary for the reader's later reference.
EVENT MANAGER 117--Receives "events" from the Network Manager 24 and determines what kind of event it is (alarm, response, or configuration), constructs a "record" using the information about the event and places the record in the appropriate queue. (Alarm Queue 114, Response Queue 116, or Configuration Queue 113). The Event Manager also receives request from the Network Test Manager 124 and places them in Request Queue 115 for forwarding to the Network Manager 24. Responses in the Response Queue 116 are answers to requests and are forwarded to the Network Test Manager 124. Alarms are sent to the Alarm Filter 118. Configuration information is sent to the Network Configuration Module 108. The Event Manager 117 also prioritizes the events where required.
ALARM FILTER 118--Posts alarms in the form of goals to the Bulletin Board 120 after removing redundant alarms through a filtering process.
NETWORK CONFIGURATION MODULE 108--Manages the Network Structure Knowledge Base 110 by interpreting information in the Configuration Queue 113 and updating the Network Structure Knowledge Base 110 accordingly so that the Expert Network Diagnostic System (ENDS) 10 always has a current picture of what the network 5 looks like.
NETWORK TEST MANAGER 124--Sends instructions to the Network Manager 24 requesting tests and further information needed to perform diagnostic functions.
BULLETIN BOARD 120--A global data structure, which could also be thought of or referred to as the Goal Queue, which holds goals to be processed by the Inference Engine 122. The Bulletin Board 120 also dynamically posts goals, tests etc. as processed by the Inference Engine 122 in a process referred to herein as `instantiation`.
STATIC KNOWLEDGE BASE 109--Stores the Expert Information Structure 111 which holds the knowledge of the Domain Expert 101 in a form usable by the Inference Engine 122. Also stores the Network Structure Knowledge Base 110 which contains information about the makeup and structure of the Network 5 and is maintained by the Network Configuration Module 108.
INFERENCE ENGINE 122--Determines which rules in the Expert Information Structure 111 to apply and applies rules stored in the Expert Information Structure 111 to goals on the Bulletin Board 120 to determine cause of Alarms. It does so by instantiating a goal tree for each goal in accordance with the Expert Information Structure 111. If further information is needed, it queries the user or the Network Test Manager 124 to obtain the information. Processing of a goal may be paused or suspended to allow processing other goals while tests are being performed.
USER INTERFACE 104--Provides Operator 128 or Domain Expert 101 with prompts, templates, menus, etc. to allow for easy entry of information or queries. The User Interface 104 operates in two modes, Expert Knowledge Acquisition and System Operation, to provide a user friendly environment in which the Domain Expert 101 or Operator 128 may interact with the Expert System 10.
CONTROLLER 112--Schedules and invokes the above modules in appropriate sequence and oversees operation of the Expert Network Diagnostic System (ENDS) 10 generally. Selects one of posted goals on Bulletin Board 120 for active status so that it can be processed by the Inference Engine 122 by examining the goal's priority number as well as its arrival time.
FIG. 3 depicts the overall flow of the operation of the invention. Recall that the invention operates in one of two basic modes: Expert Knowledge Acquisition, and System Operation. The Expert Knowledge Acquisition mode is used to allow the Domain Expert 101 (or the Operator 128) to enter domain expert knowledge into the system. The System Operation mode is used during actual operation of the system for performing diagnostics functions. For now, let us assume that the knowledge from the Domain Expert 101 has already been entered into the system and that the system is in the System Operation mode. Later, the Knowledge Acquisition process will be treated in detail.
Referring to FIG. 3, when the system is started, it enters a data acquisition phase 130 in which data (in the form of "events" from the Network Manager 24) relating to the performance of the network are reported to the Event Manager 117. In general, these data may represent network malfunctions, as will become clear later. These data are then passed to the Alarm Filter 118 which operates in conjunction with the Network Structure Knowledge Base 110 to perform analysis and filtering function; to take place at a data filter and analysis phase 131. In this phase, the data are filtered by removing redundant data and placed in a form suitable for posting on the Bulletin Board 120.
Once placed on the Bulletin Board 120, the system moves into a diagnostic phase 132 with respect to the data posted on the Bulletin Board 120. In this stage, data having highest priority of all data posted on the Bulletin Board 120 are placed in an active state. A knowledge source associated with the particular type of data is retrieved and operated upon by the Inference Engine 122 in conjunction with the Expert Information Structure 111 (i.e. the knowledge of the Domain Expert 101) and if further information is required to diagnose the problem, the Network Test Manager 124 may be invoked to perform specific tests on, or retrieve further information about, the network via the Network Manager 24.
When such further tests are required, often the process of performing the tests may take a long time or require undesirable interruption of normal network operation. In such cases, the Operator 128 may wish to pause operation on that data until a later time if a manual mode of operation is in use. In an automatic mode of operation, the system may automatically suspend operation on that data until the requested further information or test result is received. While waiting for the test result or further information to be received, the data being processed reenter the data acquisition phase 130. This allows the system to process other data ("events") in the meantime. Resumption of processing of the data takes place at the point where processing was paused or suspended. When the diagnostic phase is completed, the system enters the interpretation phase 133. In this phase, the results of the diagnostics are logged to a printer, a disk file and/or the screen for use by the operator who may be required to take various corrective actions such as ordering repairs or replacement of defective components, contacting the telephone company, etc.
Each phase of the above process is controlled, invoked and scheduled by the Controller 112. The Controller 112 may be thought of as a supervisor of the operation of the system.
The "data" referred to above takes several forms. In general, the data start out as an "event", which as used in this example communications network diagnostic system, is a message either from the Network Manager 24 to the ENDS 10 or from ENDS
10 to the Network Manager 24. Three types of events are sent from the Network Manager 24 to the ENDS 10 via communication line 22. First, a configuration event contains information about the configuration of the network 5. Second, an alarm is a report of a network malfunction. Third, a result event contains the results of a test or query such as a network component test or a query of a database for information. A fourth type of event, a request event, is sent from the ENDS 10 to the Network Manager
24 via communication line 22. It contains a request for a network device to perform a diagnostic test or a request to retrieve information describing a network object's configuration from the data base manager 30.
Of greatest interest at this point is the "alarm" event which can be thought of simply as a report issued from a network object (or Network Manager) indicating that it has detected a possible malfunction or other error condition which needs attention. The data acquisition phase 130 deals with receipt of these alarms (and other events). The Event Manager 117 converts these alarms into a data structure referred to herein as a "record" with an unique name given by the Event Manager 117. During the data filter and analysis phase 131, these records (if not redundant, e.g. two network objects report detecting the same malfunction) are posted to the Bulletin Board 120 at which point they are converted to "goals". That is, it becomes a goal of the system to find the solution to the problem which resulted in the alarm corresponding to these goals. Hereafter, the terms "event", "goal", "alarm" and "record" may be used somewhat interchangeably to represent data structures corresponding to the same "event". The term "node" is used herein to describe flow-chart blocks and hypothesis tree nodes, but those skilled in the art will appreciate that the term "node" is sometimes used in the literature to describe that which is referred to as a "record" herein. Use of the term "record" is intended to minimize confusion and not as a technical limitation for the particular type of data structure used in implementation.
As shown in FIG. 4, goals may be considered to have any of five states in the preferred embodiment of the present invention: posted, active, suspended, paused or dormant. The two remaining states, start and finished are shown for clarity of explanation. The transition from one state to the next is shown in the state flow diagram of FIG. 4. A start state 134 is defined as a condition where the system is awaiting receipt of an alarm event. When an event is received, it is held in queue at the dormant state 135 until the system can post the event. State 136 (posted) represents goals which are posted on the Bulletin Board 120, after filtering by the Alarm Filter, either, as a result of an alarm event, a test result event or resumption of a user paused event. In this state, when the goal has the highest priority of all goals on the Bulletin Board 120, it is selected for further processing. Once the posted goal has been selected for processing by the Inference Engine 122, the state changes to active state 137. In the event of suspension of processing by the test manager 124 (e.g. to perform a lengthy diagnostic test), the state changes to suspended state 139 until receipt of a test result event once again results in change to the posted state 136.
While in active state 137, the goal may be paused under certain circumstances by the Operator 101 placing the goal in paused state 140 until resumed by the user at which point the goal is returned to the posted state 140. (It may be desirable to pause the goal rather than begin a lengthy interruptive test which would disrupt communications.) In active state 137, when diagnostics is completed, the goal moves into a finished state 144 with respect to the goal of interest. The finished state 144
of FIG. 4 corresponds roughly to the interpretation phase 133 of FIG. 4. The diagnostic phase 132 of FIG. 3 corresponds to the active, posted, suspended and paused states of FIG. 4. The dormant and posted states 135 and 136 of FIG. 4 correspond roughly to the data acquisition phase 130 and the data filter and analysis phase 131 of FIG. 3.
The system may be implemented or operated (for example by selection from a menu) in any of three modes during System Operation according to the preferred embodiment. The modes are Automatic, Manual and Semi-Automatic. To understand the rational for these modes, let us digress briefly to a discussion of the data communication environment.
In the data communication and network management environment, it must be remembered that it is often the case that many alarm are of negligible importance due to transient phenomenon. Further, it should be noted that often alarms are produced due to degradation of communication, as for example in the case of a marginal transmission line which in affected by changes in weather which make it impossible to transmit at the highest data rates over such lines. In this example, a 19.2 Kbps modem might be forced to reduce its data rate to 16.8 Kbps in order to cope with the poor transmission line without introducing transmission errors. Such a rate change is normally prompted by an increased error rate or retransmission rate at the higher data rate and often causes an alarm to be sent to the Network Manager 24. In order to diagnose this problem, it might be necessary to perform interruptive tests, i.e. tests which interrupt communication such as loop-back tests.
Consider the case of a financial institution operating during normal business hours. If the above modem is serving all of the tellers in the financial institution, the lowering of the data rate to 16.8 Kbps may go unnoticed, depending upon the work load at the time. Worst case, such a data rate change will slow down response time of the tellers' terminals to some degree. If an interruptive diagnostic test is run, the modem must be taken completely out of service for a period of time varying from several minutes to much longer in order to diagnose the problem. If this were done, the tellers would be completely unable to process transactions and customer lines would back up until service was restored.
Obviously, in the above scenario poor weather is slowing down communication, but little can be done about it without disrupting service. Since the disruption of service for diagnostics would be more damaging to the day to day transaction of business than simply living with the decreased data throughput for a while, it is desirable not to implement interruptive diagnostic tests at this time. It is, however possible that there are non-interruptive tests which might pinpoint the problem to a transmission line which needs service. Such tests might be readily run without disrupting business and might lead to correction of the problem.
By allowing the three separate modes of operation, such situations can be dealt with, in a manner least disruptive to business, by the Expert Network Diagnostic System (ENDS) 10 of the present invention. In the manual mode, all tests or actions are individually under the control of the operator at all times so that all tests must be ordered by the Operator 128 before they can be performed. In the automatic mode, all tests are automatically performed by the Expert System 10 (ENDS) regardless of whether or not the test will result in an interruption. In the above environment, the automatic mode might be invoked during evenings and/or weekends when user data traffic is low or nonexistent. The tests can then be logged by the system for examination by the Operator 128 during working hours. In this manner, problems such as the above can be detected and corrective action taken at times when business will be minimally disrupted. The third mode, Semi-Automatic, may generally be defined as anything in between. In the preferred embodiment, the Semi-Automatic mode is designed so that non-interruptive tests are automatically performed while interruptive tests require consent or direction by the Operator 128. In this manner, the Operator 128
can make a Judgment as to whether or not the problem is severe enough, given the circumstances (time of day, day of week, work load, etc.), to warrant a disruption of service for diagnostic testing.
In the manual mode or semi-automatic mode, if the test is interruptive, the user will be presented with a question or instruction to perform a particular test in the preferred embodiment, such as:
______________________________________ PERFORM A SELF TEST OF MODEM 39. ##STR1## ______________________________________
At this, the Operator 128 can either pause the process if it is not an appropriate time to perform the test, or perform the test and enter the answer in the blank. If the Operator 128 pauses the test, he can later return to enter the answer and continue the diagnostic process.
To relate the various modes to FIG. 4, the suspended state is entered by the Network Test Manager 124 in the automatic mode. The Paused state is entered by the Operator 101 in the manual mode. A hybrid of these is used in the Semi-Automatic mode depending upon whether or not the test is interruptive.
Referring back to FIG. 2, a more detailed description of the interaction of the various functional blocks follows:
Expert information is entered into ENDS 10 by a Domain Expert 101 (someone with extensive experience in diagnosing problems with this network) in general. Because of the structure of the expert information used by the present invention, the services of a knowledge engineer are typically not needed or are minimal. The Domain Expert 101 uses the terminal 18 to enter the rules and procedures (expert knowledge) he has found effective in diagnosing the network 5. The User Interface 104
facilitates this data entry by providing an interactive user-friendly interface as will be described in more detail under the heading "User Interface". As the User Interface Module 104 receives the data, it stores it in a data structure called the Expert Information Structure 111.
Network structural information is entered into ENDS 10 by the Network Configuration Module 108. The Network Configuration Module 108 uses the Event Manager 117 (described below) to get configuration information from the Data Base Manager 30
within the Network Manager 24. The Network Configuration Module 108 stores a subset of the information in the Network Manager's Data Base Manager 30 in a data structure called the Network Structural Knowledge Base 110. Since the Network Manager 24 can be called upon to retrieve more detailed information if required, the Network Structure Knowledge Base 110 can be less detailed than the Network Manager's Data Base Manager 30. In alternate embodiments, it may be eliminated altogether in favor of the Network Manager's Data Base Manager 30 (e.g. in a hybrid network manager/Expert System embodiment). The Network Configuration Module 108 also uses the Data Base Manager 30 of Network Manager 24 to update the Network Structural Knowledge Base 110 while the ENDS 10 is performing diagnostic functions.
Once the Network Structure Knowledge Base 110 has been initialized the operator 128 can use the User Interface 104 to instruct ENDS 10 to perform network diagnostics. The Controller 112 schedules the various ENDS 10 modules. It first invokes the Event Manager 117 to handle communication of "events" between the Network Manager 24 and ENDS 10. All communication between the ENDS 10 and the Network 5 are handled as events.
Upon receiving one of the first three types of events described above (configuration, alarm or result) the Event Manager 117 decodes the event to determine the type of event and attributes such as the sending object, identification number and time received by the Network Manager 24. The Event Manager 117 places these attributes in data structures called records as previously described. Next, the Event Manager adds records with network configuration information to a Configuration Queue 113
associated with Event Manager 117. Records with alarm information are added to an Alarm Queue 114 associated with Event Manager 117. Records with test result information are added to a Response Queue 116 associated with Event Manager 117.
Upon receiving a request for a diagnostic test from the Network Manager 24, the Event Manager 117 encodes the test request into a request event understandable by the Network Manager 24 and sends the event via communication line 22 to the Network Manager 24.
The Controller 112 invokes the Alarm Filter 118. The Alarm Filter 118 takes alarm records from the Alarm Queue 114 and determines whether or not the alarms are redundant. If so, the Alarm Filter 118 deletes the redundant alarms. The Alarm Filter adds records corresponding to non-redundant alarms to a global data structure called the Bulletin Board 120.
The Controller 112 then selects the next goal on the Bulletin Board 120 to be processed. Controller 112 makes this determination based on three factors: status (whether the goal is ready to be processed), priority number and the amount of time since the alarm was received by the ENDS 10.
The Domain Expert 101 may determine whether the time factor should operate such that the more recently received goals take priority or whether the least recently received goals take priority. The priority number is assigned by the Domain Expert
101 according to his experience. For example, a lost power alarm from a central multiplexer such as 74 would likely have a higher priority than a high error rate alarm from a point to point connection modem such as 50. Similarly, priority numbers can relate to the physical location of the device. In general, devices located closer to the central site are often more likely to be of higher importance than at remote sites. In FIG. 1, for example, an alarm from modem 58 should be given a higher priority number, in general, than a similar alarm from modem 62 since a failure at modem 58 would disrupt communications to three terminals whereas a failure at modem 62 would be more likely to only affect terminal 64. Of course, critical paths of communication, as in military environments, can be assigned higher priority than less critical communications links. Those skilled in the art will appreciate that numerous criteria can be used to establishes priority numbers including the experience of the Domain Expert 101, business considerations, security considerations or system policies.
In a similar manner, certain alarms may be best handled as most recent having higher priority while other alarms may be best handled as least recent having higher priority. For example, an alarm relating to error rate may be due to a transient phenomenon and older goals may well be discarded altogether or at least given lower priority than more recently received goals. On the other hand, alarms relating to a line failure may be best handled in the order received. Such decisions are preferably left up to the Domain Expert 101.
The Controller 112 invokes the Inference Engine 122 to process the selected goal in the Bulletin Board 120. The Inference Engine 122 processes the goal by using information from the associated Alarm Record, the Expert Information Structure 111
and the Network Structural Knowledge Base 110 to determine the malfunction which caused the alarm associated with the goal and to remedy the malfunction. If the Inference Engine 122 determines that it needs more information to process the goal, it requests it through the Network Test Manager 124 described below.
The Inference Engine 122 uses the information in the alarm record and the Network Structural Knowledge Base 110 to determine which Expert Knowledge Source applies to the alarm. This may be done using a look-up table maintained in the ENDS 10. As the Inference Engine 122 applies the Expert Knowledge Source, it constructs a tree below the goal node on the Bulletin Board to keep track of what it has done so far in a process which will be referred to as "instantiation" and described later in more detail under the heading "Inference Engine". In so doing, if it determines that it needs more information to complete the reasoning, it can suspend the reasoning and later resume where it left off. If it determines that some action should be taken either to determine or remedy the problem, it sends a request for information to the Network Test Manager 124. The Inference Engine 122 relinquishes control upon determining that it needs more information to diagnose or remedy the problem, or upon exhausting the expert knowledge that apply to the alarm.
The Network Test Manager 124 is called from the Inference Engine 122 with a request for information. The invention takes different action depending upon whether it is operating in manual, semi-automatic or automatic modes. If the invention is operating in a manual mode, the Network Test Module prints a query to the terminal 18 and returns the user's 128's response. If the invention is in an automatic mode, the module sends the request to the Event Manager 117, which in turn sends the request to the Network Manager 24. The Network Manager 24 then obtains the result and forwards the result to the Response Queue 116 where it can be retrieved by the Event Manager. In the semi-auto mode the choice of which of the above actions to take depends upon the nature of the test required as described previously in the preferred embodiment.
Finally, the Controller 112 invokes the Network Configuration Module 108 to update the Network Structural Knowledge Base 110 if it has been changed since the last time this module ran, as explained above.
User Interface
1. High Level Menu and Overview
Turning now to FIG. 5, a flow-chart illustrates the high-level operation of the User Interface Module 104. As discussed previously, the module has two distinct purposes: Expert Knowledge Acquisition and System Operation. FIGS. 6 through 16 show the operation of the Expert Knowledge Acquisition functions. FIGS. 17 through 22 show the operation of the System Operation functions. The bottom blocks 156 through 186 of FIG. 5 should be considered procedure labels which are carried over to FIGS.
6-22. In operation, the procedures are shown as menu selections which may be selected by using a pointing device such as a mouse. Other methods of human interface may also be used including direct entry of commands corresponding to the various procedures, as will be appreciated by those skilled in the art. Each of the menu selections (procedures) are discussed briefly below and in more detail immediately following.
ADD NODE 156: Used in the Knowledge Acquisition process by the Domain Expert 101 to add a knowledge node to the Expert Information Structure 111.
MODIFY NODE INFORMATION 158: Used in the Knowledge Acquisition process by the Domain Expert 101 to change the information in an already established knowledge node.
MODIFY NODE TYPE 160: Used in the Knowledge Acquisition process by the Domain Expert 101 to change the knowledge node type.
DELETE NODE 162: Used in the Knowledge Acquisition process by the Domain Expert 101 to remove an already established knowledge node.
COPY NODE 164: Used in the Knowledge Acquisition process by the Domain Expert 101 to produce a new knowledge node which is a copy of an existing knowledge node in order to simplify adding similar knowledge nodes to the Expert Information Structure 111.
DISPLAY KNOWLEDGE SOURCE 166: Used in the Knowledge Acquisition process by the Domain Expert 101 to produce a graphic display of a currently existing (or currently being built) Knowledge Source.
SHOW KNOWLEDGE SOURCE 168: Used in the Knowledge Acquisition process by the Domain Expert 101 to produce a text display of a currently existing (or currently being built) Knowledge Source.
LOAD KNOWLEDGE SOURCE 170: Used in the Knowledge Acquisition process by the Domain Expert 101 to retrieve a saved Knowledge Source from disk storage and load it into working memory.
UPDATE ALARM MAP: Used by the Domain Expert 101 in the Knowledge Acquisition process to update a map (table stored in 109) relating alarms to knowledge sources.
SAVE KNOWLEDGE SOURCE 172: Used in the Knowledge Acquisition process by the Domain Expert 101 to save a Knowledge Source to a disk file.
CLEAR KNOWLEDGE SOURCE 174: Used in the Knowledge Acquisition process by the Domain Expert 101 to remove a Knowledge Source from working memory.
CHANGE TEST MODE 175: Used by the System Operator 128 in the System Operation process to select automatic, semi-automatic or manual operation of the Expert System 10.
RUN ONE 177: Used by the System Operator 128 in the System Operation process to invoke the Expert System 10 for a single goal only.
RUN 178: Used by the System Operator 128 in the System Operation process to invoke the Expert System 10.
BULLETIN BOARD STATUS 182: Used by the System Operator 128 in the System Operation process to write the status of each goal on the Bulletin Board 120 to a screen window.
EVENT/ALARM STATUS 183: Used by the System Operator 128 in the System Operation process to write the status of each alarm and event on the Bulletin Board 120 to a screen window. Similar in operation to BULLETIN BOARD STATUS 182 except that different information is displayed in a screen window and therefore not discussed in detail.
RESUME 184: Used by the System Operator 128 in the System Operation process to continue processing a paused goal.
EXIT 186: Used by the System Operator 128 in the System Operation process or the Domain Expert 101 in the knowledge Acquisition process to exit the Expert System operation.
The Domain Expert 101 uses the knowledge acquisition functions to enter, modify and display information in the Expert Information Structure 111. The Expert Information Structure 111 is explained in detail below, but essentially it comprises a number of Expert Knowledge Sources, one associated with each possible type of alarm and tracked in a table. Each knowledge source in turn includes a number of data structures called "knowledge nodes" or "nodes". Each type of knowledge node has a set of attributes associated with it to store characteristics of the node knowledge such as its name, its type, and the nodes to which it points. The Domain Expert 101 defines a knowledge node by assigning values to its attributes.
The program flow for User Interface Module 104 begins at start block 150. Block 152 determines which function the Domain Expert 101 wants to invoke by reading selections made by the Domain Expert 101 selected from a menu, preferably using a pointing device. Block 154 corresponds to a "case" command and selects the next block based on the selection of the Domain Expert 101: Add Node 156, Modify Node Information 158, Modify Node Type 160, Delete Node 162, Copy Node 164, Display Knowledge Source 166, Show Knowledge Source 168, Load Knowledge Source 170, Save Knowledge Source 172, Clear Knowledge Source 174, Change Test Mode 175, Run One 177, Run 178, Bulletin Board Status 182, Resume 184, Exit 186, event/alarm status 183, and update Alarm Map 171. Based on the Domain Expert's selection, program control flows to one of the procedures described by the flow-charts of FIGS. 6 through 22.
2. Expert Knowledge Acquisition
The process of acquiring the Expert Knowledge is described in detail in conjunction with the flow-charts of FIGS. 6-16.
The flow-chart in FIG. 6 illustrates User Interface 104 operation after selection of the Add Node procedure 156. Block 190 gets the name of the knowledge node that the Domain Expert 101 wants to add. It does so by, for example, allowing the user to type a name or to select one of the nodes on the list of nodes which are referred to by other nodes but not yet defined (the list of undefined nodes) using the terminal's cursor or a pointing device such as a mouse. Block 190 would preferably only accept typed names if there were no existing defined nodes by that name using conventional error trapping techniques. Next, 192 gets the type of node the Domain Expert 101 wants to add. One way it could do so is by displaying a list of the node types and allowing him to select one. Other techniques will occur to those skilled in the art. Block 194 calls a function to get the remaining attributes (those other than name and type) of the node being added. The function prompts the user for the remaining attributes of a node of the selected type which the user enters by conventional means.
In the preferred embodiment, this is done by presenting the Domain Expert 101 with a template of node attributes to be filled in. Since different node types require different sets of attributes, different templates may be presented for different node types. Using Object-Oriented Programming, such individual templates may inherit appropriate attributes from a generic template.
To help ensure the integrity of the knowledge source, two stacks are maintained which may be displayed to the Domain Expert 101 whenever Add Node 156 is selected. The first stack shows the node names for nodes which already exist. The second stack shows the node names for nodes which are pointed to by existing (defined) nodes but which have not yet been defined themselves. In this manner, the Domain Expert 101 always has easy reference to nodes which must be defined.
When Add Node procedure 156 is selected, the Domain Expert 101 is presented with a list of node types at 192 for the Domain Expert to select from among. Valid node types for HYPOTHESIS TREE nodes in the present embodiment are: KS node, AND node and OR node. KS nodes (knowledge sources nodes) are OR nodes which are the first node of a knowledge source and serve to identify the knowledge source. AND nodes are satisfied when all children of the node are satisfied while OR nodes are satisfied when any one of the node's children is satisfied as would be expected applying convention rules of logic. Other types of nodes may be defined in other embodiments.
For flow chart type nodes, valid node types in the present embodiment are: TEST node, CONCLUDE node, CALL node, CONFIRM node, FACT node, COUNT node, and DELAY node. Other types of nodes may be defined in other embodiments.
TEST nodes are nodes which require a test to be performed by the user or network manager 24 to obtain information for the Expert System 10. All nodes subsequent to a test node depend upon the outcome of the test node.
CONCLUDE nodes are nodes which generally terminate a flow-chart and provide a solution to the lowest level hypothesis for which the current flow-chart is a leaf.
CALL nodes are nodes which point to HYPOTHESIS TREE nodes in other branches of the HYPOTHESIS TREE and are useful to reduce need for redundant portions of HYPOTHESIS TREES. Thus, in addition to hypothesis tree nodes leading to flow-chart nodes in the present invention, flow-chart nodes may also lead to hypothesis tree nodes.
CONFIRM nodes are nodes which return conclusion values to the lowest level hypothesis node which leads to the flow-chart, that is, they provide the solution to the hypothesis.
FACT nodes are nodes which simply print or display a fact when instantiated.
COUNT nodes are nodes which increment (or decrement) a counter each time the COUNT node is instantiated.
DELAY nodes are nodes which impose a delay period selected by the Domain Expert 101 each time the node is instantiated.
The function of block 194 is illustrated by the flow-chart in FIG. 7 and is described below in more detail. Control then passes to block 196 which determines whether the name of the node is on the list (stack) of undefined nodes. If not, 197
returns control to the User Interface 104. Otherwise, 198 removes the name from the list of undefined nodes. Block 200 then updates all pointers to this node. The pointers are updated by finding all the defined nodes which have attributes pointing to this node and setting those attributes to the address of the current node. Finally, 202 transfers control back to step 152 of FIG. 5.
The flow-chart in FIG. 7 illustrates how the Knowledge Acquisition module 104 assigns attributes to a knowledge node. Block 212 determines whether there are more attributes of the current node type to assign. If not, 214 returns control to the function which called it. Otherwise, 216 gets the value of the next attribute from the Domain Expert 101 by, for example, displaying the name of the attribute and allowing the Domain Expert 101 to type in the value at terminal 18. Block 218 then determines whether the value is a pointer to another node. If not, control returns to 212. Otherwise, 224 determines whether the node pointed to (called "pointed-to") has been previously defined by checking the list of defined nodes. If so, 226 sets the attribute to the address of the "pointed-to" node and control goes to 212. Otherwise, 228 sets the value of the attribute to nil. Block 230 then determines whether the "pointed-to" node is on the list of undefined nodes. If so, control returns to
212. Otherwise, 232 adds "pointed-to" to the list of undefined nodes and control returns to 212. The process proceeds as described above until a no answer is received at step 212 after which control passes back to the procedure that called it.
The flow-chart in FIG. 8 illustrates User Interface 104 operation after selection of Modify Node Information 158. First, 266 gets the name of the knowledge node the Domain Expert 101 wants to modify. It does so by, for example, displaying a list of the names of the defined nodes and allowing the Domain Expert 101 to select one. Control then passes to Block 267 where the type of node is determined by the previously defined or existing node type. Block 268 then retrieves the remaining attributes of the node by invoking the Add Node procedure 210 to allow the Domain Expert 101 to change any of the attributes except the node name and type. Ideally, the default values of the attributes are the original values. This process is illustrated by FIG. 7 and explained above. Finally, 270 returns control to the User Interface by transferring control back to step 152 of FIG. 5.
The flow-chart in FIG. 9 illustrates User Interface 104 operation after selection of Modify Node Type procedure 160. A separate procedure for modifying the knowledge node type is provided and not integrated with the Modify Node Information procedure 158 in the preferred embodiment. At step 284 the procedure gets the name of the node to change the type of from the Domain Expert 101. As before, for example, it might display a list of the names of defined nodes and allow the Domain Expert
101 to select one. Control then passes to step 288 which gets the new node type from the Domain Expert 101. It could do this by, for example, displaying a list of the node types and allowing the Domain Expert 101 to select one. Control then goes to step 294 which calls the Get Node Attributes procedure 210 to get the attributes (other than name and type) of the node from the Domain Expert 101. Finally, 296 transfers control to the main User Interface program at step 152.
The flow-chart in FIG. 10 illustrates User Interface 104 operation after selection of Delete Node procedure 162. First, 310 gets the name of the knowledge node to delete from the Domain Expert 101 by, for example, displaying the names of defined nodes and allowing him to select one. Second, 318 removes the name chosen from the list of defined knowledge nodes. A verification process may be included in this step to help assure that nodes are not accidentally deleted. Block 320 then determines whether any attributes of the other nodes point to the node. If not, 332 returns control to the User Interface 104. Otherwise, 324 deletes all such attributes to ensure that no nodes point to the deleted node. Finally, 328 transfers control to 152.
The flow-chart in FIG. 11 illustrates User Interface 104 operation after selection of Copy Node procedure 164. First, block 340 gets from the Domain Expert 101 the name of the knowledge node to copy (called the source) by, for example, displaying a list of the defined nodes and allowing him to select one. Second, 346 gets from the Domain Expert 101 the name of the node to which the information will be copied (called the target). It might do this by requesting the Domain Expert 101 to type a name, and accepting the name if there are no existing defined nodes by that name. Third, 354 sets the values of the target attributes to the values of the corresponding source attributes. Fourth, 355 calls the get node attributes function 210 to enable the Domain Expert 101 to change the attributes of the target (except node type). Block 356 then determines whether the name is on the list of undefined nodes. If not, 358 transfers control to step 152 of FIG. 5. Otherwise, 360 removes the name of the target from a stack list maintained by the system which keeps track of all undefined nodes (the undefined stack list). Block 362 then updates the pointers to the target. It does so by finding all the defined nodes which have attributes pointing to the target and setting those attributes to the address of it. Finally, 364 transfers control to step 152 of FIG. 5.
This Copy Node procedure 164 is designed to facilitate data entry by allowing easy entry of data for new nodes by in effect duplicating existing nodes. The attributes differing from the node being duplicated may then be modified since the procedure automatically calls the Get Node Attributes routine 210. Other such procedures for simplifying data entry when creating nodes will occur to those skilled in the art.
The User Interface provides two mechanisms for displaying the knowledge source: Display Knowledge Source procedure 166 and Show Knowledge Source procedure 168. Selection of Display Knowledge Source 166, the User Interface 104 presents a graphical representation of the Expert Information Structure 111 while selection of Show Knowledge source 168 provides a text output of the defined nodes. In the Display Knowledge Source procedure 166, the system displays, for example, a diagram such as that in FIG. 12. The circles (shown as ovals in the figure) represent nodes. Text inside of or adjacent to the nodes are used to identify the node or operation being displayed. In the example of FIG. 12, the display shows a receive line failure (RLF) for the example network of FIG. 1 with the associated nodes. The diagram shows nodes as ovals with the node names written in the ovals. The nodes pointed to by a node are drawn below the node and connected to it with a line with a link name associated with it. In other embodiments, the node names may be used to represent the nodes with lines interconnecting the names (i.e. no circles representing nodes). Similarly, the display may flow left to right, etc., rather than top to bottom and information may be in symbolic or abbreviated form as required.
The circle 366 represents the alarm node associated with the knowledge base. The text in the circle is the name of the node. Node 366 points to nodes 368, 370 and 372 which represent three possible causes of the alarm. The lines connecting the nodes are indicated as OR links indicating that any of these nodes alone could result in the RLF Alarm 366.
For example, the node drawn as 366 is an OR node named "RLF Alarm" (RLF stands for Remote Line Failure) and written inside. The node is an OR node, meaning that its truth value is true if at least one of the nodes to which it points is true. It points to nodes named p-to-p-remote-rlf, p-to-p-central-rlf and multidrop-central-rlf. These are represented as 368, 370 and 372.
Each of the nodes 368, 370 and 372 may be further broken down into other OR or AND links as shown in connection with node 370. Node 370 includes four nodes below it which are shown as OR functions meaning that node 370 is true if any one of these four nodes is true, namely: bad phone line node 374, bad remote 376, bad central 378 or transient phenomenon 380. Each of these four nodes may point to other nodes or processes as illustrated by node 380 which points to a short process represented as a flow-chart. This flow-chart first determines if the central modem responds at 382 and if so, an automatic test of DCD (Data Carrier Detect) is performed at 384. If not a manual test of DCD is indicated by step 386.
Display Knowledge Source procedure 166 may be implemented in a conventional manner using the standard drawing tools commercially available in the Expert System Programming Environment 16 such as those in Carnegie Group's Knowledge Craft.TM. product or using known CAD (computer aided drafting techniques). The information for creating such a display is readily available from the knowledge base 109 which has information relating each of the nodes to each other as well as the attributes of each node. This display capability provides the user or expert an easily grasped representation of what procedure is indicated by a particular alarm.
The flow-chart in FIG. 13 illustrates User Interface 104 operation after selection of the Show Knowledge Source procedure 166. Block 386 prints the list of defined knowledge nodes of the currently loaded knowledge source to the terminal's display 18 (or to the printer 20). Block 388 then similarly prints the list of undefined nodes, that is, nodes which have been referred to but which have not yet been defined. Finally, 390 transfers control back to step 152 of FIG. 5.
The flow-chart in FIG. 14 illustrates User Interface 104 operation after selection of the Load Knowledge Source procedure 170. First, step 400 gets the name of the knowledge source to load. It might do so by, for example, displaying a list of knowledge sources and allowing the Domain Expert 101 to choose one. Next, 402 loads the selected knowledge source into memory. Finally, 404 transfers control back to 152 of FIG. 5.
The procedure Update Alarm Map 171 is used by the Domain Expert 101 to update a map or table relating alarm type to knowledge sources. The present invention may use several knowledge sources broken up as the Domain Expert 101 sees fit. In order to process an alarm, the alarm is first related to an appropriate knowledge source by reference to this table. When the Domain Expert 101 wishes to add a new alarm or knowledge source, he updates this table to properly relate the alarm with the appropriate knowledge source. The table may be updated using conventional table update methods.
The flow-chart in FIG. 15 illustrates User Interface 104 operation after selection of the Save Knowledge Source procedure 172. First, block 410 gets the name of the knowledge source to save by allowing the Domain Expert 101 to type a file name. Next, 412 writes the knowledge source, list of defined nodes and list of undefined nodes to a file with that name. Finally, 414 returns control to the User Interface 104 at step 152.
The flow-chart in FIG. 16 illustrates User Interface 104 operation after selection of Clear Knowledge Source 174. Block 420 clears the current knowledge source from memory, and 422 transfers control to 152.
3. System Operation
FIGS. 17 through 22 illustrate the operation of the system operation functions of the User Interface 104. The Operator 128 uses the system operation functions to set the operation of the ENDS 10 to manual, semi-automatic or automatic mode, run ENDS, display the Bulletin Board 120, resume inferencing paused goals and exit ENDS 10.
The flow-chart in FIG. 17 illustrates User Interface 104 operation after selection of the Change Test Mode procedure 175. First, block 434 gets the mode of operation by allowing the Domain Expert 101 to select from manual, automatic or semi-automatic. The nature of the three modes has been explained previously. At step 436, a global variable called "mode" is updated to reflect the selection made in step 434. Finally, 438 returns control to the User Interface 104 at step 152 of FIG.
5.
FIG. 18 illustrates the operation of the User Interface 104 after selection of Run 178. Block 450 initializes the structures and variables used during operation of the system. Block 452 then invokes the Controller 112, which schedules the modules of the system as illustrated in FIG. 2. The operation of the Controller 112 is described in a later section.
FIG. 19 illustrates the operation of the User Interface 104 after selection of Run One 177. Block 456 initializes the structures and variables used during operation of the system. Block 458 then invokes the Controller to process the next goal only. Finally, 460 transfers control to step 152 of FIG. 5.
FIG. 20 illustrates the operation of the User Interface 104 after selection of the Display Bulletin Board Status procedure 182. First, 470 clears a display window set up on the screen for the Bulletin Board Display in a conventional-manner. Next, 472 writes the name of each goal on the Bulletin Board 120 along with its status and priority. Goal status and priority are attributes associated with the goal which are used to determine the order of processing the particular goal as explained previously. Block 474 then transfers control to step 152 of FIG. 5.
FIG. 21 illustrates the operation of the User Interface 104 after selection of the Resume procedure 184. This procedure is used to resume operation on a goal which has been paused. A goal is assigned one of several status states (as was described in more detail in connection with FIG. 4) including "posted" indicating that the goal is ready to run, "active" when the goal is actually being processed, "paused" when the goal has been paused by the user in manual or semi-automatic mode, and "suspended" when the goal has been running but has been temporarily stopped by the system in automatic or semiautomatic mode as when, for example, further information is required to complete processing a goal. A typical example is when a test, such as a loop-back test must be performed in order to proceed with processing the goal tree. While this test is being performed, the goal is placed in the suspended status. While the goal is suspended, the ENDS 10 retrieves the next goal from the Bulletin Board
120 for processing so that the system does not have to operate in a completely sequential manner. A goal may of course also be assigned a status of "paused" in manual mode as described in conjunction with FIG. 4. Operation of the Resume procedure 184
resumes processing of the paused goal.
When the Resume process 184 is invoked, block 490 first gets the name of the goal to resume. It does so by, for example, listing all paused goals and allowing the user to select one. Second, 492 changes the status of the selected goal to "posted", to indicate that it is ready to run. When the Inference Engine 122 is available and the `resumed` goal is next in line for processing, processing will resume where it left off. Note that if a higher priority goal is on the Bulletin Board 120, it will be processed first even though the `resumed` goal was interrupted. This is because that the goal is resumed by placing it back in the normal queue for processing. A marker is associated with the `resumed` goal at the time it is "paused" so that the place where processing is to resume is readily determined by the Inference Engine 122. Finally, 494 transfers control to step 152 of FIG. 5.
FIG. 22 illustrates the operation of the User Interface 104 after selection of Exit 186. Block 510 writes all the data on the Bulletin Board 120 to a log file. The name of the file may be a default file name or may be individually selected by the user at the time of exiting. Block 512 then halts operation. To begin operation again, the user must invoke the Expert System 10 again.
4. Sample Session (Manual Mode)
In order to get a better idea of the operation of the system, the following Table 1 is a sample session conducted in the Manual Mode of operation. Operator input is shown in underlined normal text while output from the Expert System 10 is shown in bold.
TABLE 1 ______________________________________ Sample Session ______________________________________ An RLF alarm has been received for CENTRAL modem on a MULTIDROP line. This alarm may be caused by: A Transient Condition, A mis-strapped Central unit, A defective diagnostics board. Testing for a Transient alarm condition. Are all devices on this line currently active? ##STR2## Clear this alarm and monitor the system for 2 minutes. Did the Alarm occur again? ##STR3## The RLF Alarm is still present. Continuing diagnostics! *Testing for an improperly strapped CENTRAL unit. Does the CENTRAL unit have hardware straps? ##STR4## Begin checking CENTRAL unit's DCD strapping. Is the CENTRAL unit strapped for constant DCD? ##STR5## A CENTRAL unit on a MULTIDROP line should always be strapped for SWITCHED DCD. This CENTRAL unit is improperly strapped. Correct strapping. ______________________________________
Expert Information Structure
Because of the complexity of data communication network diagnostics, it may be easier to understand the Expert Information Structure 111 with an example from a more familiar subject. FIG. 23 therefore illustrates a block diagram of a possible use of the structure for an automobile diagnostic system. This example system includes three hypothetical diagnostic machines, each of which is connected to an automobile which is the diagnostic target for this example. Diagnostic Machine A 514 is connected to automobile A 516 via connection 517. Diagnostic Machine B 518 is connected to automobile B 520 via connection 522. Diagnostic Machine C 524 is connected to automobile C 526 via connection 528. Each of the three diagnostic machines is then connected to the Expert Auto Diagnostic System 530 via connections 532,534 and 536 respectively which may, for example be wireless radio links in this hypothetical case. Each diagnostic machine monitors the operation of the car it is connected to and sends an alarm to the Expert Auto Diagnostic System 530 upon discovering a problem. This hypothetical network has similarities to the network diagnostic environment in that there are certain tests which should not be invoked automatically except when certain conditions prevail. For example, it might be dangerous to invoke certain brake tests or transmission tests at highway speeds without the driver's consent. A set of criteria could be developed for this example which would dictate the ability of the diagnostic system to automatically initiate tests. These hypothetical diagnostic machines are assumed to have the capability of performing tests on the car and correcting certain problems subject to such constraints.
FIG. 24 shows an Expert Information Structure for the Expert Auto Diagnostic System 530. The information in the example is for instruction as to the operation of the expert information structure only. It should be noted that this is an overly simplified contrived example and is not necessarily an accurate diagnostic procedure for troubleshooting an automobile radiator. It is nonetheless an easier environment for many to grasp than that of a complex data communications environment and will thus be used herein for instructional purposes.
The User Interface organizes the nodes defined by the Domain Expert into a data structure called an Expert Information Structure, such as that pictured in FIG. 24. The structure combines the simplicity and efficiency of flow-chart based knowledge representation with the hierarchical organization of hypothesis tree-based knowledge representation.
The hypothesis tree is a hierarchical structure of hypotheses, each of which is associated with a symptom or possible failure. The structure is processed by using a process of elimination among hypotheses. The Domain Expert creates such an Expert Information Structure based upon his expert knowledge of the workings of the system to be diagnosed. The Domain Expert may start the process, for example, by isolating each different type of malfunction (alarm) which must be analyzed. Each of these alarms, or certain groups of such alarms, may lend themselves to creation of an individual knowledge sources. In FIG. 24, five such types of malfunction are shown (542, 544, 546, 547) explicitly and others suggested implicitly by the arrows leading from node 540. The Domain Expert may select any number of knowledge sources to construct so that the problem is broken down into a manageable size. He then constructs a hypothesis tree for each type of malfunction and proceeds to further divide the hypothesis into further hypothesis (e.g. 548, 549, 550) until a leaf in the hypothesis tree is reached. At he leaf, there may be questions to be answered or test procedures to be followed. The hypothesis tree then converts to a flow-chart-format (e.g. 562 down, 551 down) in order to obtain the needed information. Such flow-chart operations may call other flow-charts or other hypothesis tree nodes in the same (or another) hypothesis tree in order to satisfy the hypothesis. This hybrid of the hypothesis tree and the flow-chart is referred to herein as the Expert Information Structure or the Structured Flow Graph. Each knowledge source so constructed is related to the particular type of alarm through a knowledge source table.
At the top of the tree of FIG. 24 is a root node 540 which is pointed to by no other knowledge nodes. The root node 540 points to nodes such as 542, 544, 546 and 547 representing all possible alarms. The latter nodes represent the high level hypotheses. These are confirmed or rejected by first decomposing them into "sub-hypothesis", which are represented by the knowledge nodes pointed to by each alarm. Each of these nodes in turn may point to nodes representing the sub-hypothesis which could have caused it. The pattern can continue until the hypotheses can be most easily confirmed or eliminated by the answer to a question asked by the user, the response to a test sent to the Network Test Manager or the application of deterministic, procedure-oriented testing processes which may include gathering information from the user or a device analogous to the Network Manager. Such processes are represented by flow-charts.
In other words, the root node 540 represents the most general malfunction: that there has been a malfunction. Leaves represent the more specific malfunctions. The system uses the structure for reasoning as follows. Upon receiving an alarm, the system processes the sub-tree having the node associated with the alarm as its root. The goal of the system is then to determine the nature of the problem and then to resolve the problem. Each alarm node points to the possible causes of the associated alarm. The system determines whether each cause occurred by determining whether the nodes that the node points to has occurred. This pattern continues through all branches and leaves of the tree. Whether a malfunction/resolution at the leaf level occurred is determined by carrying out the procedures in the flow-chart to which the leaf points.
In the example of FIG. 24, the Expert Information Structure has one root node 540 which points to knowledge nodes which represent every possible alarm (called alarm nodes). This node is automatically defined by the system when the domain expert defines the alarm nodes. The alarm nodes in the example are Won't Start 542, Stalls 544, Overheats 546 and Power Loss 547. Each alarm node in turn points to nodes representing the malfunctions which could have caused the alarm. In the example, Overhears 546 points to nodes representing all of the possible causes of an engine overheating. This example presumes that only three such possibilities exist as shown in FIG. 24. Radiator Leaks 548 represents a leaking radiator, Coolant Low 549
indicates that the radiator is low on coolant and Bad Thermostat 550 represents a malfunctioning thermostat.
Any one of these problems could cause a car to overheat. A Domain Expert therefore would use the User Interface to define overheats 546 as an OR node, which means that it occurred if at least one of the nodes it points to occurred. OR nodes have the following attributes: IS-A, NODE-ID, DUE-TO-CONDITIONS, YES-NODE, PRE-DESCRIPTION, QUERY, HAS-TEST, CONCLUSION-IF-TRUE, CONCLUSION-IF-FALSE, FORGET, and HELP.
The attributes YES-NODE, QUERY, HAS-TEST, FORGET and HELP have no purpose in an alarm node of the preferred embodiment and will be explained below. For all node types, the attribute IS-A contains the node type. All knowledge node types for the present embodiment have been previously discussed. For OR nodes, this will be "OR node". For all node types, NODE-ID contains the name of the node. For Overheat node 546, this would be "Overheats". DUE-TO-OR-NODE contains the names of the nodes from which the truth value of this node 546 is determined. In the example, Overheats 546 will be true if one of the nodes radiator-leaks 548, coolant-low 549, or bad-thermostat 550 occurred (in this example, we assume that these are the only possible causes of overheating). DUE-TO-CONDITIONS contains the node names in DUE-TO-OR-NODES, each followed by either "yes" or "no". The value after each node name indicates whether to consider that node to be true if the problem associated occurred ("yes") or did not occur ("no"). In the example, the attribute would contain "(radiator-leaks yes) (coolant-low yes) (bad-thermostat yes)". Therefore, the node 546 will be true if either radiator-leaks, or coolant-low or bad-thermostat nodes is true.
For all knowledge node types, PRE-DESCRIPTION contains the message to be printed upon reaching the node. PRE-DESCRIPTION enables the user to see what ENDS is doing, and assists the Domain expert in debugging. The PRE-DESCRIPTION in Overheats
546 could contain the message "Car overheated. Either the radiator leaks, the coolant is low or the thermostat is bad."
For all knowledge node types, CONCLUSION-IF-TRUE 562 contains the message to be printed upon determining that the node is true and CONCLUSION-IF-FALSE 566 contains the message to be printed upon determining that node 549 is false. One purpose of CONCLUSION-IF-TRUE and CONCLUSION-IF-FALSE is to enable the user to track the progress of ENDS and to assist the domain expert in debugging. Overheat probably would contain no message in CONCLUSION-IF-TRUE or CONCLUSION-IF-FALSE because the truth value of an alarm node is known.
For all knowledge node types, FORGET controls whether the node 546 will be processed if its truth value has already been determined. For example, if 546 were pointed to by more than one node, FORGET would determine whether ENDS would process 546
a second time to determine its value. The value of FORGET in Overheat 546 would be inconsequential because the node 546 is pointed to only once, by Overheats 546.
Table 2 below summarizes the attributes of the Overheats node 546:
TABLE 2 ______________________________________ ATTRIBUTES OF "OVERHEATS" ______________________________________ IS-A or node NODE-ID overheats DUE-TO-OR-NODE (radiator-leaks, coolant-low, bad- thermostat) DUE-TO-CONDITIONS (radiator-leaks yes) (coolant-low yes) (bad-thermostat yes) YES-NODE PRE-DESCRIPTION "Car overheated. Either the radiator leaks, the coolant is low or the thermostat is bad." QUERY HAS-TEST CONCLUSION IF TRUE CONCLUSION IF FALSE FORGET HELP ______________________________________
The first knowledge node pointed to by Overheat 546 is Radiator-leaks 548. Because it points to no other knowledge tree node, it is a leaf node. Because all leaf nodes are OR nodes in this embodiment, the Domain Expert would use the User Interface to define Radiator-leaks as an OR node. He would set the attributes as follows. DUE-TO-OR-NODE and DUE-TO-CONDITIONS would be empty because the node 548 points to no other nodes. YES-NODE is the first node of the flow-chart to which a leaf node might point. Radiator-leaks 548 points to a flow-chart with procedures to determine whether the radiator leaks. If so, the procedures repair the leak and determines whether the radiator needs coolant. If so, the procedures add coolant. YES-NODE therefore contains Is-radiator-leaking, the first node in that flow-chart. PRE-DESCRIPTION might contain "Current hypothesis: radiator leaks". The attributes QUERY and HAS-TEST apply only to leaf nodes which do not point to flow-charts and will be explained below. CONCLUSION-IF-TRUE could contain "Radiator leak repaired". CONCLUSION-IF-FALSE would contain "Radiator was not leaking". The value of FORGET here would be inconsequential, because the node 548 is only called once.
The HELP attribute contains nothing in this case since the user is not prompted for a test result. If present, the HELP attribute can be used in a number of different ways as desired by the system designer. For example, the HELP attribute can be used as a pointer to or file name of a help file which contains information which is context sensitive. This help file may contain, for example, text or graphical assistance to the user or may place the user within a network map showing the network surrounding the location of an alarm so that the operator can get a better idea of what type of problems is being encountered and the ramifications of such problems to allow the operator to make better informed decisions regarding corrective actions. Those skilled in the art will appreciate that many possible implementations are possible for use of the HELP attribute.
The attributes of node 548 (radiator leaks) are summarized in Table 3 below.
TABLE 3 ______________________________________ ATTRIBUTES OF "RADIATOR-LEAKS" ______________________________________ IS-A OR node NODE-ID radiator-leaks DUE-TO-OR-NODE DUE-TO-CONDITIONS YES-NODE is-radiator-leaking PRE-DESCRIPTION "Current hypothesis:radiator leaks." QUERY HAS-TEST CONCLUSION IF "radiator leak repaired" TRUE CONCLUSION IF "radiator was not leaking" FALSE FORGET HELP ______________________________________
Radiator-leaks points to the flow-chart beginning with knowledge node Is-radiator-leaking 551. Is-radiator-leaking is a TEST node because the subsequent, flow-chart instruction depends on the answer to a question sent to the user or the Diagnostic Machine. Test nodes have attributes IS-A, NODE-ID, YES-NODE, NO-NODE, HAS-TEST, PRE-DESCRIPTION, QUERY, CONCLUSION-IF-TRUE, CONCLUSION-IF-FALSE, FORGET and HELP. YES-NODE contains the name of the node to branch to if the answer to the question is yes. For Is-radiator-leaking, YES-NODE is Find-leak. NO-NODE contains the name of the node to branch to if the answer to the question is no. The FORGET attribute is used when the node may be pointed to during more than one analysis. IF the information obtained In the test is static there is no need, generally, to re-test and FORGET="no". If the information is more dynamic or volatile, then it should be re-tested whenever the test node is encountered an FORGET="yes". For Is-radiator-leaking, NO-NODE is Confirm-no-leak. Table 4 below shows the attributes of is-radiator-leaking:
TABLE 4 ______________________________________ ATTRIBUTES OF "IS-RADIATOR-LEAKING" ______________________________________ IS-A test-node NODE-ID is-radiator-leaking YES-NODE find-leak NO-NODE confirm-no-leak QUERY "is radiator leaking?" HAS-TEST PRE-DESCRIPTION: "check radiator!" CONCLUSION-IF-TRUE: "radiator is leaking" CONCLUSION-IF-FALSE: "radiator is ok" FORGET: yes HELP: "while car is running, look for dripping coolant" ______________________________________
HAS-TEST is the name of the test to send to the Diagnostic Machine if the system is operating in the automatic mode. Automatic mode and manual mode are explained below (see Inference Engine). For Is-radiator-leaking, HAS-TEST would be a test to physically check the radiator for leaks.
QUERY contains the question to print to the user if the system is operating in manual mode, explained below. The answer to the question will be the equivalent to the response to the HAS-TEST test.
CONCLUSION-IF-TRUE and CONCLUSION-IF-FALSE in Is-radiator leaking would contain appropriate messages to the user. Because Is-radiator-leaking is pointed to only by Radiator-leaks, FORGET would be inconsequential.
HELP contains the message to print if the user requests help while this node 551 is being processed. Help messages are therefore used only in nodes which have a QUERY. For Is-radiator-leaking, HELP could contain "While car is running, look for dripping coolant".
If the user or diagnostic computer returned "false" in response to being asked whether the radiator leaked, control would go to Confirm-no-leak 552, a CONFIRM node which returns the flow of control to Radiator-leaks 548 with a value of "false". CONFIRM nodes contain the following attributes: IS-A, NODE-ID, BRANCH-TO-AND-CONFIRM, NEXT-HYPOTHESIS, NODE-HAVING-HYPOTHESIS, and PRE-DESCRIPTION. BRANCH-TO-AND-CONFIRM contains the name of the node to return control and a truth value to. Normally this is the leaf node which called the flow-chart of which the confirm node is a member. For Confirm-no-leak, BRANCH-TO-AND-CONFIRM contains "Radiator-leaks no". This