Home
Patent Search
IMT Blog
REGISTER
|
SIGN IN
United States Patent
5924094
Sutter
July 13, 1999
Title
Independent distributed database system
Abstract
An independent distributed database system comprising a plurality of sites wherein all users at all sites work off-line with local data. All application transactions are against the local database only, and every site stores "all and only" the data it needs. On-line transactions occur only in the background, including a periodical "synch" between sites that transmits any changes to data of interest to that site. If the background operations are interrupted or the network is temporarily unavailable, the user does not see new changes made at other sites until the data link is available again, but is otherwise unaffected. It is a feature that no site acts as a "server" for any other site. Some sites may store more data or have more users than others, but all sites are logically peers.
Inventors:
Sutter; Herbert P.
(Oakville,
CA
)
Assignee:
Current Network Technologies Corporation
(Mississauga,
CA
)
Appl. No.:
742024
Filed:
November 1, 1996
Current U.S. Class:
707/10
707/101
707/102
707/2
707/201
707/3
707/4
707/5
707/8
707/9
707/1
Current International Class:
G06F 17/30 (20060101)
Field of Search:
707/3,8,10,2,1,4,5,7,9,101,102,103,201 364/284.1,284.2,284.4,242.94 395/200.59,182.02,200.31,726,728 711/129 370/400,408
U.S. Patent Documents
5608874
March 1997
Ogawa
5649185
July 1997
Antognini
5664189
September 1997
Wilcox
5678041
October 1997
Baker
5682537
October 1997
Davies
5687363
November 1997
Oulid-Aissa
5721909
February 1998
Oulid-Aissa
Foreign Patent Documents
0 458 623 A2
Nov., 1991
EP
0 714 966 A2
May., 1996
EP
Other References
"Updating Loosely-Coupled SQL/DS Databases", IBM Technical Disclosure Bulletin vol. 33, No. 6B, Nov. 1, 1990, pp. 395-396. .
Bouguettaya, A. et al: "Co-Database Approach to Database Interoperability", IEICE Transactions on Information and Systems, vol. E78-D, No.11, Nov. 1, 1995, pp. 1388-1395. .
McHugh, J. et al: "Multilevel Security Issues in Distributed Database Management Systems", Computers & Security International Journal Devoted to the Study of Technical and Financial Aspects of Computer Security, vol. 7, No.4, Aug. 1, 1988, pp. 387-396. .
Yu, P.S. et al: "Dynamic Transaction Routing in Distributed Database Systems", IEEE Transactions on Software Engineering, New York, NY, US, vol.14, No.9, Sep. 1988, pp. 1307-1318..~
Primary Examiner:
Black; Thomas G.
Assistant Examiner:
Mizrahi; Diane D.
Attorney, Agent or Firm:
Ridout & Maybee
Claims
What is claimed is:
1. A distributed relational database system for a computer network, said system comprising:
a plurality of sites;
each of said sites including processing means for storing and retrieving information locally and independent of said other sites, and wherein each of said sites is the logical peer of said other sites;
said sites having means for connecting to said network and communicating with other sites connected to the network;
said processing means including means for transferring selected information stored locally by connecting to said network and transferring said selected information to other sites connected to the network;
wherein said database system comprises a plurality of activities and each activitiectivities comprises selected sites belonging to an activity group;
wherein said sites comprise spine sites and non-spine sites, said spine sites exhibiting high availability to the network, and said non-spine sites exhibiting low availability to the network;
means for selectively manning said snine sites comprising a spanning tree having nodes corresponding to said spine sites; and
means for removing a spine site from said spanning tree in response to a site leaving said activity group.
2. A distributed relational database system for a computer network, said system comprising:
a plurality of sites;
each of said sites including processing means for storing and retrieving information locally and independent of said other sites, and wherein each of said sites is the logical peer of said other sites;
said sites having means for connecting to said network and communicating with other sites connected to the network;
said processing means including means for transfering selected information stored locally by connecting to said network and transferring said selected information to other sites connected to the network;
wherein said database system comprises a plurality of activities and each of said activities comprises selected sites belonging to an activity group;
wherein information is stored in tables, and said tables comprise columns and rows, and said tables are grouped into record fragments, each of said record fragments including one or more columns in a row;
wherein said processing means includes a local clock and means for generating time-stamps for each of said fragments stored locally at said site, and said time-stamp providing an age for the corresponding fragment; and
said means for transferring selected information comprises replication means for replicating selected fragments at other sites, and said selected fragments comprising most recent fragments as determined from said time-stamps.
3. The distributed relational database system as claimed in claim 2, wherein said replication operation comprises a background data synchronization operation between sites connected to said network.
4. The distributed relational database system as claimed in claim 2, wherein said time-stamp comprises a date and time field, and said date and time field is relative to said local clock at said site where said information unit is stored.
5. A distributed relational database system for a computer network, said system comprising:
a plurality of sites;
each of said sites including processing means for storing and retrieving information locally and independent of said other sites, and wherein each of said sites is the logical peer of said other sites;
said sites having means for connecting to said network and communicating with other sites connected to the network;
said processing means including means for transferring selected information stored locally by connecting to said network and transferring said selected information to other sites connected to the network;
wherein said database system comprises a plurality of activities and each of said activities comprises selected sites belonging to an activity group;
wherein information is stored in tables, and said tables comprise columns and rows, and said tables are grouped into record fragments, each of said record fragments including one or more columns in a row;
wherein said processing means includes a local clock and means for generating time-stamps for each of said fragments stored locally at said site, and said time-stamp providing an age for the corresponding fragment;
wherein said sites comprise spine sites and non-spine sites, said spine sites exhibiting high availability to the network, and said non-spine sites exhibiting low availability to the network; and
replicator means for replicating changes to selected fragments between said spine and non-spine sites, said replicator means comprising means at each non-spite site for transmitting selected fragments to one of said spine sites, and said spine sites having means for sharing said selected fragments, and said spine sites including means for forwarding said shared data fragments to other non-spine sites which link with said spine sites.
6. The distributed relational database system as claimed in claim 5, wherein said selected fragments comprise fragments having the most recent changes as determined from said time-stamps.
7. A distributed relational database system for a computer network, said system comprising:
a plurality of sites;
each of said sites including processing means for storing and retrieving information locally and independent of said other sites, and wherein each of said sites is the logical peer of said other sites;
said sites having means for connecting to said network and communicating with other sites connected to the network;
said processing means including means for transferring selected information stored locally by connecting to said network and transferring said selected information to other sites connected to the network;
wherein said database system comprises a plurality of activities and each of said activities comprises selected sites belonging to an activity group;
wherein said activity group is defined by said sites collaborating on an activity;
wherein information is stored in tables, and said tables comprise a plurality of columns and rows, and said tables are grouped into record fragments, each of said record fragments including one or more columns in a row;
means for securing an application for activity group; and
wherein said means for securing includes a hierarchical trust structure comprising a trusted root, an organization certification authority and an application certification authority, said trusted root having means for generating license certificates for said organization certification authority, and said organization certification authority having means for generating license certificates for said application certification authority, and said application certification authority having means for generating license certificates for site and users belonging to said application network or activity group.
8. The distributed relational database system as claimed in claim 4, wherein said means for generating license certificates for said organization certification authority comprises a signing key.
9. The distributed relational database system as claimed in claim 7, wherein said means for generating license certificates for sites and users belonging to said application network comprises a signing key for user certificates and a signing key for command certificates.
10. The distributed relational database system as claimed in claim 7, wherein said means for generating license certificates for sites and users includes means for generating a site certificate for a new site added to said network.
11. The distributed relational database system as claimed in claim 10, wherein said means for generating a site certificate comprises an application-site certification signing key.
12. A distributed relational database system for a computer network, said system comprising:
a plurality of sites;
each of said sites including processing means for storing and retrieving information locally and independent of said other sites and wherein each of said sites is the logical peer of said other sites;
said sites having means for connecting to said network and communicating with other sites connected to the network;
said processing means including means for transferring selected information stored locally by connecting to said network and transferring said selected information to other sites connected to the network;
wherein said database system comprises a plurality of activities and each of said activities comprises selected sites belonging to an activity group;
wherein said activity group is defined by said sites collaborating on an activity;
wherein information is stored in tables, and said tables comprise a plurality of columns and rows, and said tables are grouped into record fragments, each of said record fragments including one or more columns in a row;
means for securing an application for activity group including means for validating the integrity of a fragment; and
wherein said means for validating comprises an encrypted field, and said site having means for generating said encrypted field, and said encrypted field being generated at the site at which the fragment was last changed.
13. The distributed relational database system as claimed in claim 12, wherein said encrypted field is derived from a fragHash value for said fragment.
14. The distributed relational database system as claimed in claim 12, wherein said means for securing includes means for encrypting a field in a record.
15. The distributed relational database system as claimed in claim 14, wherein said encrypted field is derived from a uniqueHash value for the field in said record.
16. The distributed relational database system as claimed in claim 12, wherein said means for securing includes means for securing a local database.
17. The distributed relational database system as claimed in claim 16, wherein said means for securing the local database comprises design-time security tables.
18. The distributed relational database system as claimed in claim 16, wherein said means for securing the local database comprises run-time permissions tables.
19. The distributed relational database system as claimed in claim 18, including means for applying permissions defined in said run-time permissions tables in groups to selected users or sites.
20. The distributed relational database system as claimed in claim 16, further including means for determining a minimal set of cryptographic security keys for a selected user to work with said application.
21. A distributed relational database system for a computer network, said system comprising:
a plurality of sites;
each of said sites including processing means for storing and retrieving information locally and independent of said other sites, and wherein each of said sites is the logical peer of said other sites;
said sites having means for connecting to said network and communicating with other sites connected to the network;
said processing means including means for transferring selected information stored locally by connecting to said network and transferring said selected information to other sites connected to the network;
wherein said database system comprises a plurality of activities and each of said activities comprises selected sites belonging to an activity group;
wherein said activity group is defined by said sites collaborating on an activity;
wherein information is stored in tables, and said tables comprise a plurality of columns and rows, and said tables are grouped into record fragments, each of said record fragments including one or more columns in a row;
means for creating distributed records for said fragments and providing each of said distribution records with a unique identity.
22. The distributed relational database system as claimed in claim 21, wherein said unique identity comprises a unique record identifier and a fragment number denoting the columns of the records that the fragment represents.
23. The distributed relational database system as claimed in claim 22, wherein said means for creating distributed records includes means for allocating said unique identifiers.
24. In a distributed relational database system, a method for determining a reference time between sites belonging to said system and being coupled by a computer network, said sites having local processing means and time generators, said method comprising the steps of:
(a) sending a first message from an initiator site to a receiver site at a start time;
(b) determining an arrival time when said first message is received at said receiver site;
(c) said receiver site sending a second message to said initiator site in response to receipt of said first message;
(d) determining a reply time when said second message is received at said initiator site;
(e) said initiator site determining a reference time from the midpoint of the interval between said start and
(f) said initiator site determining a reference time from the midpoint of the interval between said start and reply times, and said receiver site using said arrival time as its reference time.
25. In a distributed relational database system, a method for determining a reference time between sites belonging to said system and being coupled by a computer network, said sites having local processing means and clocks, said method comprising the steps of:
(a) sending a first message from an initiator site to a receiver site at a time t1;
(b) said receiver site determining a time t2 when said first message is received;
(c) said receiver site sending a second message at time t3 to said initiator site in response to receipt of said first message;
(d) said initiator site determining a time t4 when said second message is received;
(e) after said second message is received, said initiator site sending a third message at time t5 to said receiver site;
(f) said receiver site determining a time t6 when said third message is received;
(g) said initiator site determining a first time value by calculating a midpoint for the interval between said time t1 and said time t4, and generating a first time difference by comparing said first time value with said time t2 when said first message was received by said receiver site;
(h) said receiver site determining a second time value by calculating a midpoint for the interval between said time t3 and said time t6, and said receiver site generating a second time difference by comparing said second time value with said time t4 when said second message was received by said initiator site;
(i) averaging said first and second time differences to produce an average time difference, wherein said initiator site uses a reference time relative to its local clock, and said receiver site uses said average time difference to calculate a corresponding reference time relative to its local clock.
26. In a distributed relational database system comprising sites coupled by a computer network and the sites having local processing means and clocks, a method for checking the clocks at the sites, said method comprising the steps of:
(a) identifying a designated time keeper site from among said sites;
(b) determining a time difference value between the clock of said designated time keeper site and the clock at the other site;
(c) generating a time-stamp at said other site by off-setting the time of the local clock at said other site with said time difference value.
27. The method for checking clocks as claimed in claim 26, further including the steps of: periodically obtaining a time reading from said designated time keeper and storing said time reading, and comparing said stored time reading with a current time reading from said designated time keeper.
28. A method for securing information in a database, the information being stored in tables having columns and rows, and the information is grouped into record fragments and each of said fragments comprises one or more columns in a row, said method comprising the steps of:
(a) generating a digest of the contents of the fragment;
(c) encrypting said stamp data value to produce an encrypted stamp data value, wherein said encrypting step uses an encryption key modified by information in said fragment.
29. A security structure in a distributed relational database system having a plurality of sites connected to a computer network and having means for communicating over the computer network, said security structure comprising:
(a) a trusted root, an organization certification authority, and an application certification authority;
(b) said trusted root having means for generating license certificates for validating said organization certification authority;
(c) said organization certification authority having means for generating license certificates for validating said application authority; and
(d) said application certification authority having means for generating license certificates for selected sites wherein said selected sites belong to an application network and said selected sites use said license certificates for validating each other.
30. The security structure as claimed in claim 29, wherein said application certification authority includes means for generating license certificates for validating users at said sites.
31. The security structure as claimed in claim 30, wherein said application certification authority includes means for generating license certificates for validating releases of software at said sites.
Description
FIELD OF THE INVENTION
The present invention relates to distributed databases, and more particularly to an independent distributed relational database system operating over a local area network (LAN) or a wide area network (WAN).
BACKGROUND OF THE INVENTION
Databases comprise one of the most widely used applications found in computing today. A database is a collection of related information about a subject organized in a useful manner that provides a base for procedures such as retrieving information, drawing conclusions and making decisions. A distributed database is a variation in which information is distributed or spread over a number of sites which are connected through a communication network.
A key problem in current database design is providing equal database access to all users whether they are local or remote. For example, to provide equal access to sales agents with their portable computers, to executives working from home, to work groups at satellite offices, to business partners, and to suppliers, presents a challenge to existing database design. Advantageously, each user should be able to use and change selected information from their computer, with the same performance and functionality that they would enjoy at a workstation located at head office with the server.
While the prior art includes numerous database management systems, none of the existing systems provide "completely equal access". Known systems which allow off-site users to work with in-office information systems require the remote users to access an office LAN or central database server through expensive, slow, and often insecure dial-up lines, WAN links, and remote-access products. The major problems associated with the prior art approaches can be classified under Performance, Scalability, Reliability, Availability, Autonomy, and Security.
Performance. The remote user experiences inferior performance because the user is forced to access data at a remote location using slow modem, or WAN, connections. Furthermore, the actual data is retransmitted every time it is accessed, thereby requiring fast and/or expensive connections in order to achieve acceptable performance.
Scalability. The central server must be able to support all local and remote users. As users are added, the central server eventually becomes the bottleneck. Known systems are typically limited to about 1000 concurrent users.
Reliability. The central server must be regularly backed up. If a problem occurs, work done since the last backup is lost and all the dependent users must re-enter their recent work. Connection faults are also a common problem for remote users of known systems. When there is a connection fault, the user is interrupted until the connection is re-established.
From the foregoing, it will be appreciated that reliance on a central site or service is undesirable because that site could become a bottleneck as well as a point of failure.
Availability. In known systems, all remote users depend on a central server. If the central server is down, then all users are down and cannot work with the database until the server recovers. Windows of acceptable down time are measured in seconds or minutes and servers are typically required to deliver better than 98% availability during working hours.
Autonomy. The remote users are partly or fully dependent on the server. Remote users will not always have an on-line data connection, e.g. modem or WAN, but, a remote user can only work with the database when on-line and connected. For example, sales agents on the road, or executives in airplanes, cannot use the database without very slow, expensive and unreliable cellular or satellite datalinks to keep them on-line to the central server.
Security. Remote access links in conventional systems are often not encrypted. Even when the links are protected, the organization loses control of any data sent to the remote machine. The remote user, who may not be an employee, but a supplier or customer, can use the information however they want.
Currently there are two main approaches to sharing relational databases: traditional distributed databases, and traditional replication systems.
Traditional distributed databases distribute the servers only, so that each site stores some subset (none, all, or some) of the rows and columns in each table. Clients access the distributed database by connecting normally to their local server. When a client makes a request which involves data stored at other servers, the network handles the request appropriately and returns the expected result so that to the client it appears as though the request was handled locally. Multi-site integrity is controlled through complex two-phase commit and equivalent protocols.
The main problem with this approach is that a database transaction must be performed on-line and involve every server which stores information involved in the transaction. This is costly in performance, since the speed of network connections between distributed servers affects the speed of the client's transaction. It is also fragile, since all involved servers must be available for the transaction to succeed. If any server(s) are not available, then the client receives either an incomplete result or the client's transaction fails completely which results are both undesirable.
With respect to replication, traditional replication systems replicate data primarily between servers but may also be used to replicate data to client machines. The main problems with known replication systems are that they are not peer systems, are complex to administer and maintain, and have integrity problems because of replication granularity. They are not peer systems because they distinguish between "master" and "replica" databases and cannot support fully equal operation at all sites, local and remote. They are complex to administer and maintain because distribution rules are typically configured using row and column selection for every table at every site. For example, a simple change operation (e.g. "Site 3 now needs Customer #531's information") typically requires extensive and error-prone changes to row and column selections in multiple database tables for that site.
Further, existing replication systems have integrity problems because they typically use record-level or field-level replication granularity. Consider a Customer table with fields Address, City, State, ZipCode, PaymentTerms, and CreditRating. With record-level granularity, the fields of an entire record are replicated together, which gives false collisions that are tedious for administrators to review; for example, changing a customer's Address at one site and the CreditRating at another will result in a collision even though the two fields are unrelated (i.e., the person changing the Address knows that the customer moved, while the person changing the CreditRating knows that the customer failed to pay promptly). With field-level granularity, each field in a record is replicated separately, which solves the false collisions of record-level replication but causes integrity problems by not reporting collisions when two related fields are changed at different sites; for example, changing a customer's Address at one site and the ZipCode at another should result in a collision (i.e., the person changing the Address because he knows that the customer has moved must also know enough to change the ZipCode).
The IDDB database according to the present invention overcomes the disadvantages associated with the prior art. The present invention provides a database architecture in which all users at all sites work off-line with local data, that-is, all application transactions are against the local database only, and every site locally stores "all and only" the data it needs. This means that application transactions are not network-dependent and therefore do not suffer speed or availability problems when the network or remote sites are down or loaded. The on-line transactions only occur in the background, including a periodic synchronization between sites that transmit any changes to data that is of interest to the site. If the background operations are interrupted or the network is temporarily unavailable, the user does not see new changes made at other sites until the datalink is again established, but otherwise the user remains unaffected. According to the present invention, no site acts as a "server" for any other site, however, some sites may store more data or have more users than others, but all sites are logically peers.
BRIEF SUMMARY OF THE INVENTION
The present invention provides an architecture for an independent distributed database or IDDB. In the IDDB, all sites, i.e. nodes, are peers and no site acts as a server for another. This means that unlike conventional database replication systems, the distributed database according to the present invention does not distinguish between "master" and "slave" sites, or "primary" and "secondary" sites, or "service" and "replica" sites. With the IDDB, any subset of sites continue to operate normally without the need for a master site.
Each site stores "all and only" the data it needs. It is a feature of the present invention that users work off-line with local data, and all application transactions are against the local database. Sites sharing the same data synchronize their changes periodically in the background and changes made at one site become visible to all the other interested sites. It is a feature of the IDDB that there are no on-line or distributed application transactions because all application transactions are local. There are network transactions for performing replication and housekeeping functions, but they operate in the background and are not visible to the application, or the user.
In respect of the shortcomings associated with the prior art architectures as described above, the database architecture according to the present invention provides a significant improvement in these areas.
Performance. According to the invention, all users utilize local databases to which they have high-speed (i.e. network or same machine) access. There is no dependency on remote datalinks for any part of normal operation. According to another aspect, the background sync transactions are faster because only changed data is transmitted, and then only once to each affected site. This feature greatly reduces the bandwidth requirements and thereby allows the use of slow (and inexpensive) modem links for most business applications.
Scalability. According to the present invention, there is no central server requirement. Thus, no site acts as a server for any other site and as a result no site becomes a bottleneck to user expansion (as commonly experienced with the central server architecture of known systems). As a result, the communications load and hardware requirements at each site are independent of the size of the network. For example, if a sales agent using his notebook is working with 200 customers, the communications load is defined by the changes made to those 200 customers and the local database will store only those 200 customers. If next year there are ten times as many sales agents, then each sales agent will still be storing about 200 customers in their local, i.e. notebook, computer, and the agent is still working with his 200 customers and the communications and local database loads for the agent remain unchanged regardless of the total size of the network or the total number of sites and users. The IDDB according to the present invention runs an application as easily at 10 sites as it does at 10,000.
Reliability. It is feature of the present invention that redundancy is built into the network, thereby reducing or eliminating the need for backups. If a site is destroyed, the IDDB application is reinstalled with a blank database and connected to the network. Once re-attached to the network, the application receives an initial download and recovers all of its information from the other sites connected in the network to achieve normal operation. The only data that would be lost are the changes made at the site since the last sync operation, however, no users at the other sites are affected or need to re-enter data. The IDDB provides full reliability because no site depends on another site for its operation.
Availability. According to the invention, if one site is down, no other site is affected because no site depends on another and all work at a site is done off-line by default. If all other sites in the network are down for a week or a month, and users at the remaining sites continue working, the users will eventually notice that their changes are not being seen by anyone else and that no one else's changes are appearing to them. As a result, windows of acceptable down time can be measured in days or weeks, not seconds or minutes as in prior art systems. According to this aspect of the invention, the IDDB provides improved availability primarily because it always fully replicates all data.
Autonomy. According to the present invention, the sites are fully independent of each other and also independent of the communications link. For example, sales agents who are on the road, and executives who are travelling in airplanes, can continue working as usual regardless of whether they are currently connected to a modem or a network link. It is a feature of the IDDB that all data needed for an application(s) is actually stored at each local site. This means that users on the IDDB are able to work with data without knowing exactly where else, i.e. at other sites in the network, the data is also stored.
Furthermore, the IDDB exhibits fragmentation independence, that is, sites in the IDDB operate as though the database is not fragmented at all, because for each site its local copy of a table is the whole table. Each site, however, will by definition have some fragment of the database, defined by the information its users need.
Security. It is a feature of the IDDB database according to the present invention that all communication links are encrypted. All data stored locally, even on an untrusted machine run by a potentially untrusted user, is secured so that it can be accessed only through a legitimate application running on the system.
According to another aspect of the invention, the IDDB features a network architecture which comprises one or more application networks. An application network is defined as the set of all sites running a given IDDB application. The application network is a virtual network running on top of a physical network connection. It is a feature of the invention that a given site may run several different IDDB applications at the same time.
The network architecture preferably comprises a network structure that allows all sites to communicate efficiently and effectively. In particular, the network structure preferably has the capability to distinguish between stable sites and transient sites in order to minimize dependencies on transient sites. A stable site is defined as a site which features high availability and forms a long-term component of the application network, for example a site or node located within the organization that owns or operates the application. A transient site, on the other hand, comprises a site which is either intermittently available or a short-term participant in the application network, for example a computer belonging to a mobile user or users outside of the parent organization.
The network structure according to the present invention also features fault detection and repair mechanisms, including automatic network reconfiguration.
The network structure also comprises suitable sub-networks for each activity group. An activity group is defined as the group of sites presently collaborating on a given activity, i.e. storing a copy of that activity's data (or some sub-set thereof). Preferably, the network structure provides the capability to manage dependencies on transient sites which are participating and provides effective automatic error recovery and reconfiguration.
The independent database according to the present invention also features the capability to replicate updates, so that any change made in an activity at a site becomes visible to all sites belonging to the activity group in that application network. According to this aspect of the invention, updates are propagated. To do this efficiently, two sites must be able to agree on the "age" of each piece of data in the database, so that newer versions correctly update older ones without introducing unnecessary updates when both sites already have the same version of the data. Accordingly, the present invention includes mechanisms to allow fragment age agreement and accommodation of relative clock drift between sites, and the means for providing consistent local time stamping when there are several, and possibly inconsistent, local clocks at the same site.
Based on the activity as the unit of collaboration, the replication rules according to the present invention feature ease of implementation and administration. A simple change (e.g., "Site 3 now needs Customer #531's information") requires a simple command only ("attach to Customer #531"), and the IDDBMS automatically includes all related information in related tables. Using the fragment as the unit of replication, fields with a common update responsibility are replicated as a unit; changes to unrelated fields (e.g., Address and CreditRating) never result in false collisions requiring tedious administration, and changes to related fields (e.g., Address and ZipCode) are always correctly identified as collisions.
The IDDB according to the present invention also features a novel independent distributed database management system (IDDBMS). According to this aspect of the invention, a database comprises a collection of activities that can be collaborated on by various users at various sites and services that users and sites can selectively use. The IDDBMS according to the present invention provides a mechanism whereby a site, working off-line from all others, can create a new record and therefore a new key. The new keys are generated off-line in such a manner that the generated key is guaranteed to be unique across the entire database. In addition, the IDDBMS includes means for correctly handling record deletion and record modification across the entire database.
In another aspect, the database management system (IDDBMS) according to the present invention includes means for replicating modified data. The means for replicating modified data comprises a fine-grained replication process based on record fragments. A record fragment according to the present invention is a piece of an individual record, and comprises a subset of columns in a record.
In another aspect, the IDDBMS according to the present invention includes means for determining whether a fragment has been damaged and means for recovering a damaged fragment.
In yet another aspect, the IDDBMS according to the present invention includes means for securing the information transmitted across the application networks. Since each site may be part of several application networks (i.e. if the user has installed multiple IDDB applications), the security of each application must be isolated so that each application provider can separately handle the user's permissions, password change requirements, and other security details for the application regardless of the user's access privileges to other applications running at the same site. In particular, a user having privileges in one application must not have the capability to use this authority to gain greater access to the database of another application. According to this aspect of the invention, the IDDBMS includes means for ensuring that the application's database can be read and written only through a legitimate application program and by legitimate users. In particular, the IDDBMS prevents a user from bypassing the application and inspecting or changing the physical contents of the local database file.
Another feature of the IDDBMS according to the present invention is the elimination of the need for distributed query processing. In a traditional distributed database, query optimization is critical for the performance of the system. In the present invention, query processing is simplified because transactions do not depend on the availability of other sites in the system, i.e. all database transactions are local. Furthermore, the need for a distributed transaction manager is also eliminated.
In yet another aspect, the IDDB according to the present invention provides a means for operating inherently incompatible commercially available Database Management Systems (DBMS). According to this aspect of the invention, the IDDB utilizes a DBMS-independent channel, for example, ODBC (Open Database Connectivity), for accessing the database product, and the IDDB separates the distribution and security controls from the physical database. This feature allows existing database management systems (DBMS's), such as, ORACLE.TM., INGRES.TM., SYBASE.TM., PARADOX.TM. and ACCESS.TM. products, to be used together transparently at different sites on the same application network in the IDDB.
In a first aspect, the present invention provides a distributed relational database system for a computer network, system comprising: a plurality of sites; each of sites including processing means for storing and retrieving information locally and independent of other sites, and wherein each of sites is the logical peer of the other sites; the sites having means for connecting to the network and communicating with other sites connected to the network; and the processing means including means for transferring selected information stored locally by connecting to the network and transferring the selected information to other sites connected to the network.
In a second aspect, there is provided a security structure for a distributed relational database system having a plurality of sites connected to a computer network and having means for communicating over the computer network, the security structure comprising: (a) a trusted root, an organization certification authority, and an application certification authority; (b) the trusted root having means for generating license certificates for validating the organization certification authority; (c) the organization certification authority having means for generating license certificates for validating the application authority; and (d) the application certification authority having means for generating license certificates for selected sites wherein the selected sites belong to an application network and the selected sites use said license certificates for validating each other.
In a third aspect, the present invention provides a method for determining a reference time between sites belonging to a distributed relational database system and being coupled by a computer network, the sites having local processing means and time generators, the method comprising the steps of: (a) sending a first message from an initiator site to a receiver site at a start time; (b) determining an arrival time when the first message is received at the receiver site; (c) said receiver site sending a second message to the initiator site in response to receipt of the first message; (d) determining a reply time when the second message is received at the initiator site; (e) the initiator site determining a reference time from the midpoint of the interval between the start and reply times, and the receiver site using the arrival time as its reference time.
In another aspect, the present invention provides a method for determining a reference time between sites in a distributed relational database system, the sites being coupled by a computer network and having local processing means and clocks, the method comprising the steps of: (a) sending a first message from an initiator site to a receiver site at a time t1; (b) the receiver site determining a time t2 when the first message is received; (c) the receiver site sending a second message at time t3
to the initiator site in response to receipt of the first message; (d) initiator site determining a time t4 when the second message is received; (e) after the second message is received, the initiator site sending a third message at time t5 to the receiver site; (f) said receiver site determining a time t6 when the third message is received; (g) the initiator site determining a first time value by calculating a midpoint for the interval between the time t1 and the time t4, and generating a first time difference by comparing the first time value with the time t2 when the first message was received by the receiver site; (h) the receiver site determining a second time value by calculating a midpoint for the interval between the time t3 and the time t6, and the receiver site generating a second time difference by comparing the second time value with the time t4 when the second message was received by the initiator site; (i) averaging the first and second time differences to produce an average time difference, wherein the initiator site uses a reference time relative to its local clock, and the receiver site uses the average time difference to calculate a corresponding reference time relative to its local clock.
In yet a further aspect, the present invention provides for a distributed relational database system comprising sites coupled by a computer network and the sites having local processing means and clocks, a method for checking the clocks at the sites, the method comprising the steps of: (a) identifying a designated time keeper site from among the sites; (b) determining a time difference value between the clock of the designated time keeper site and the clock at the other site; (c) generating a time-stamp at the other site by off-setting the time of the local clock at said other site with said time difference value.
BRIEF DESCRIPTION OF THE DRAWINGS
Reference will now be made, by way of example, to the accompanying drawings which show preferred embodiments of the present invention, and in which:
FIG. 1 shows in diagrammatic form a network topology for an independent distributed database (IDDB) according to the present invention;
FIG. 2(a) shows in diagrammatic form an example of an IDDB application network according to the present invention;
FIG. 2(b) shows the global activity group for the application network of FIG. 2(a);
FIG. 2(c) shows an activity group for a first customer in the application network of FIG. 2(a);
FIG. 2(d) shows an activity group for another customer in the application network of FIG. 2(a);
FIG. 3 shows in diagrammatic form another exemplary IDDB application network and activity groups according to the present invention;
FIG. 4 shows in block diagram form a software architecture for the IDDB of FIG. 1;
FIG. 5 shows in diagrammatic form update propagation in an activity group according to the present invention;
FIG. 6(a) shows a first example of a balanced spanning tree;
FIG. 6(b) shows a second example of a balanced spanning tree;
FIG. 7 shows mapping for the first example of FIG. 6(a);
FIG. 8 shows in block diagram a database structure for an IDDB application;
FIG. 9 shows the operations for identifying activity tables for the database structure of FIG. 8;
FIG. 10 shows the operations for assigning the tables to activities;
FIG. 11 shows an example of a non-rooted activity part for the IDDB of FIG. 8;
FIG. 12 shows exemplary design-time tables for the IDDB application of FIG. 8;
FIG. 13 shows exemplary runtime permission tables for the IDDB of FIG. 8;
FIG. 14 shows exemplary network tables for the IDDB of FIG. 8;
FIG. 15 shows exemplary local and support tables for the IDDB of FIG. 8;
FIG. 16 shows a trust structure according to the present invention;
FIG. 17 shows a method for stamp field generation according to the present invention;
FIG. 18 shows a stamp field optimized for replication;
FIG. 19 shows a subset of design-time tables;
FIG. 20 shows a subset of runtime permission tables;
FIG. 21 shows a method for field encryption according to the present invention;
FIG. 22 shows a dsecLogon according to the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention provides an Independent Distributed Database system and a relational database management system which will also be referred to as an IDDB and IDDBMS, respectively, in the following description.
According to the invention, all sites in the IDDB, i.e. nodes, are peers and no site acts as a server for another. Each site stores "all and only" the data it needs. It is a feature of the present invention that users always work off-line with local data, and all application transactions are against a local database. Sites sharing the same data synchronize their changes periodically in the background and changes made at one site become visible to all the other interested sites. It is a feature of the IDDB database that there are no on-line or distributed application transactions, as all application transactions are local. There are network transactions, but they operate fully in the background and are not visible to the application, i.e. the user.
Reference is first made to FIG. 1, which shows in diagrammatic form an Independent Distributed Database or IDDB indicated generally by reference 1. The IDDB 1 comprises an application running as a virtual network which is defined by sites running a given IDDB application on a physical communication network. As shown in FIG. 1, the IDDB application 1 comprises a Local Area Network or LAN 2, a Wide Area Network or WAN 4, and a number of remote computers 6. The LAN 2 comprises a server 8
and workstations 10, indicated individually as 10a, . . . 10n. The LAN 2 is coupled to the WAN 4 through a gateway 12, and the WAN 4 comprises workstations 14, shown individually as 14a, 14b, 14c. The IDDB application network 1 also includes a series of workstations 6a, 6b, 6c which access the LAN 2 through dial-up access, for example, a modem connection. Each computer or workstation corresponds to a site in the IDDB application network 1.
From FIG. 1, it will be understood that the IDDB 1 is formed from the set of sites (e.g. server 8, workstations 10, 14 and remote or mobile computers 6) which run a given IDDB application. In this sense, the IDDB application network 1 is a virtual network which runs on a physical communication transport, i.e. LAN, WAN or Internet) and is defined by the members or sites running a particular application. According to the invention, a site can run more than one application, and therefore more than one IDDB application network will exist over the same physical network. For example, in FIG. 1, the remote or mobile computers 6a, 6b, 6c and the server 8 belong to an IDDB application network for the sales department, and the computers 10a,
10b and server 8 belong to an IDDB application network for accounts receivable.
The IDDB application network 1 may also be viewed as a clique, meaning that every site or node is assumed to be able to initiate or receive a connection link to or from every other site or node as required. This means that the underlying physical networking system handles all routing and supplies a logical direct link to every other node. If the IDDB application runs on a common network, such as the Internet, then all sites will have a connection to the common network and every site is accessible by another site using the appropriate IP address. As depicted in FIG. 1, the IDDB application network 1 does not run on a common physical network, but rather comprises the LAN 2, the WAN 4 and remote dial-in sites 6.
The network topology for an IDDB application network 1 according to the present invention is considered further with the examples shown in FIGS. 2 and 3.
Reference is made to FIG. 2(a) which shows an example of an IDDB application network denoted generally by 20. The IDDB application network 20 comprises a head office 22, branch offices 24 and individual office workstations 26 or mobile users 28. Head office 22 typically comprises a large computer, e.g. mainframe, running a DB2.TM. database, or the like, which contains the entire database for the organization (although it will be understood that a central database is not required by the IDDB). Branch offices 24 are shown individually as 24a, 24b, 24c, . . . and typically comprise a UNIX-based machine running an Oracle.TM. database or the like. The individual workstations 26 comprise desktop or laptop computers with a Windows.TM. platform and run Paradox.TM. database or the like.
No lines between the sites are shown in FIG. 2(a) because according to the invention all sites, i.e. head office 22, branch offices 24, individual workstations 26 and mobile users 28, are peers and communicate with each other regardless of the presence or absence of other sites in the network 20. For example, mobile user 28b is free to communicate with branch office 24b regardless of whether head office 22 is connected. The structuring and controlling of links between the sites will be described below. As will also described in more detail below, the IDDB application network 20 comprises activity groups.
FIG. 2(b) shows a global activity group 30. By definition, the global activity group 30 is a special activity group which includes all the sites and all the information (including system housekeeping information which is not contained in an explicit activity). In particular, the global activity contains user and site database information, global user permissions, translations of text (e.g. static text tags, menu text . . . ) appearing in the application's user interface and reports.
The information in a global activity is used at all sites, and therefore the global activity group 30 includes all the sites, i.e. head office 22, the branch offices 24, the individual work stations 26 and mobile users 28, in the IDDB network 20
as depicted in FIG. 2(b).
FIGS. 2(c) and 2(d) show individual activity groups. The sites belonging to the activity groups may be the same, overlapping or distinct. For example, if the IDDB application comprises a customer database, then FIG. 2(c) represents an activity group 32 for customer no. 1003875, and FIG. 2(d) represents an activity group 34 for customer no. 1019845. As can be seen, mobile users 28d, 28e belong to both activity groups 32 and 34, i.e. sites 28d and 28e have both customers 1003875 and 1019845. The remaining sites, e.g. 22, 24c, 26d, 28c, are non-overlapping. It is noted that no branch offices 24 or head office 22 are included in the activity group of the information for customer 1019845. In practical terms, this means that head office 22 and the branch offices 24 are not interested in the information of customer 1019845.
It will also be understood that some or all of the sites in the application network 20 may also appear in different application networks (not shown), so that the sites participate independently in each application network.
Reference is next made to FIG. 3 which shows an exemplary IDDB application network 36 for a construction company. In FIG. 3, all the sites are shown with the same symbol as they are peers regardless of the individual physical database size at the site. The IDDB network 36 comprises three activity groups 36a, 36b, 36c. In the context of a construction company, the first activity group 36a comprises the sites participating in "Project A"; the second group 36b comprises the sites in Project B; and the third group 36c includes the sites involved with Project C.
The IDDB 1 distinguishes between two types of sites: (1) stable sites, and (2) transient sites. Stable sites are sites which are long-term members of the application network and are assumed to have consistently high availability. Stable sites are typically machines within, and controlled by, the organization that owns the application, for example, the mainframe 22 and workstations 26 at head office, and the mini-computer and workstations 26 at the branch offices 24. Transient sites, on the other hand, are machines which are either short-term participants in the application network or have only intermittent availability. Transient sites typically comprise machines belonging to home users or mobile users who may not always be connected to the communication network, or sites belonging to their organizations. In the context of a spanning tree, the stable sites are termed "spine sites" and the transient sites are termed as "non-spine sites".
By distinguishing between stable and transient sites, the IDDB 1 attempts to avoid dependencies on sites with uncertain availability, preferably, a site never attempts to contact a transient site, and communications with a transient site are initiated by the transient machine when it attempts to contact a stable site. The algorithms and procedures for distinguishing between stable and transient sites are described below.
In order to replicate all changes made to an activity to all the other sites in an activity group, the database management system for the IDDB (i.e. IDDBMS) includes a network clock and procedures which enable two sites to agree on the age of changed data, so that changes replicate correctly, but unchanged information is not transmitted redundantly. The network clock procedures include procedures for relative clocks and reference time agreement, drift resistant clocks, and checked clocks as will be described in detail below.
Reference is next made to FIG. 4 which shows the software architecture for an IDDB site 20 according to the present invention. The IDDB site 20 comprises the IDDBMS and a number of processes which are depicted as blocks in FIG. 4. The IDDB site
20 includes a registry server 22, a replication engine 24 (DRE.EXE), an administrator application 26 (DA.EXE), and one or more user applications 28, shown individually as 28a, 28b, 28c . . . As shown in FIG. 4, the processes run on separate machines which are coupled to a local area network 21, alternatively, the processes run on a single machine at the site 20. The replication engine 24, the administrator application 26, and the user application 28 are run-time processes, and according to the invention, all communications between any of these run-time components must be secured, i.e. authenticated and encrypted. As shown in FIG. 4, the user application 28 interfaces to the network 21 through an unextended security library 30 (DSEC.DLL), and the administrator application 26 and replication engine 24 interface to the network 21 through an extended security library 32 (DSECX.LIB). For the arrangement shown in FIG. 4, the IDDB site includes one application database 34. The application database 34 resides on an application database server 36 which is accessed through the network 21. It will however be appreciated that the IDDB can run more than one application database.
As shown in FIG. 4, the IDDBMS (i.e. database management system for the IDDB) comprises the registry server 22 and registry database 38, the administrator application 26, the replication engine 24 and user interface 40, and the security libraries
30,32. The user application 28 and application database 34 while present are not part of the IDDBMS.
Referring to FIG. 4, the registry server 22 comprises a run-time component which stores a registry database 38 for the IDDBMS. The IDDBMS registry database 38 tracks the IDDB applications installed at the site 20 and each application's customized security, distribution, and other rules. The registry database 38 may be installed on any machine where it is accessible by the replication engine 24, for example, including the same machine on which the replication engine 24 runs. There is one registry database installed at each site.
Referring to FIG. 4, the replication engine 24 is another run-time process which manages the site-to-site distribution controls for the IDDB applications installed at the site 20. The replication engine 24 is a daemon process that uses a replication engine user interface (DREUI.EXE) 40 to provide the user interface for operator control at the site 20. As shown in FIG. 4, the replication engine includes an interface to one or more other networks 42 for connecting the other sites (not shown) running the application(s). The replication engine 24 uses the registry database 38 (in the registry server 22) to determine the installed applications and their rules. There will be at most one running replication engine 24 at each site 20
regardless of the number of IDDB applications that may be installed at the site 20.
Referring to FIG. 4, the replication engine user interface 40 is a run-time component which provides a user interface for operator control of the replication engine 24 at the site 20. As shown in FIG. 4, the replication engine user interface 40
runs on a separate machine coupled to the LAN 21. The replication engine user interface 40 may be configured in one of three ways. The replication engine user interface 40 runs on the same machine as the replication engine 24. This is typical for a site with a single machine. Secondly, the replication engine user interface 40 runs on another machine located on the same LAN 21 as the replication engine 24. This scenario is shown in FIG. 4 and is typical for office sites where an operator prefers to control the replication engine 24 from a desktop computer rather than walking to the machine running the replication engine 24, this is particularly useful when several different replication engines are operated. Thirdly, the replication engine user interface 40 runs remotely. Such an arrangement is preferred by organizations which have central administrative control over selected remote sites and they do not want to pass control to local site operators.
Referring to FIG. 4, the administrator application 26 comprises a general purpose (i.e. generic) tool which is used to administer the IDDB applications 28 running at the site 20, while keeping each application's security system separate and distinct. The administrator application 26 provides control over user and activity permissions, site operations, and other facilities required by the IDDB applications 28. Using tools (i.e. utilities) provided by the IDDBMS, an application designer will specify the sets of permission types for each activity, and the administrator application 26 allows the application administrators at run-time to combine the basic permission types into permission groups to which users and sites can be assigned (as described in more detail below). According to this aspect of the invention, administration functions are included in a generic tool which is available to application developers. Thus, there may be several instances of the application administrator 26
running at the same time on different machines, or administering the same or different applications.
Referring to FIG. 4, the security library 32 provides a client application developer with controlled access to a subset of the security engine, and in particular to encryption/decryption and time-stamping functions that the end application uses to read from and write to the application database 34. According this aspect of the invention, all end-user applications share the same security library 30. Preferably, the security library 30 is further encapsulated by a high level class library, for example, in a Delphi.TM. product, the TSecure table component is preferred instead of the usual TTable component which is used for all database access.
The security library 32 for the replication engine 24 comprises an extended version of the security library 30 and is statically linked into the administrator application 26 and the replication engine user interface 40. The extended security library 32 differs from the application security library 30 in that it allows full access to the entire application database 34, including records of user permissions and administrative functions normally denied to client applications. For example, client applications are never permitted to encrypt using keys to tables which store user permissions, and therefore are prevented from bypassing the administrator application 26 to generate their own legitimate data in those tables in a way that will be accepted by the replication engine 24. Accordingly, if there are several instances of the administrator application 26 running on the same machine at the same time, each is loaded with its own copy of the security library 32. It is possible to publish the security library 32 as a DLL, but as will be appreciated by those skilled in the art this creates a security risk and attackers might be able to use the security library 32 to create malicious applications, which cannot be created using the unextended security library 30. (The operation of both the extended and the unextended security libraries 30,32 is restricted to the permissions owned by the user whose user.sub.-- id and password are supplied on start-up.)
Referring to FIG. 4, the user application 28 interfaces to the IDDBMS and presents its own custom graphical user interface and reporting functions. The user application 28 allows a user to access the application database 34 running on the database server 36. The user application 28 runs on top of the IDDBMS and in accordance with the invention, all database transactions performed by the user application 28 are against the local application database 34 only. The (unextended) security library 30 provides the user application 28 with database encryption, time-stamping, and other services, as will be described in more detail below. While each application 28 is registered once in the registry database 38, there may be several copies of an application (e.g. 28a, 28b and 28c) running at the same time, and at the same or different machines.
Referring still to FIG. 4, the application database server 36 stores the local (i.e. site) copy of the application database 34. The application database 34 may be running on the same machine as the client application 28, or on a separate server machine 36 (as depicted in FIG. 4), or on a cluster of server machines (not shown). Such choices will depend on the preferences of the designer for the application 28.
The software for the server 36 is typically supplied by the DBMS vendor, and may comprise a simple local database driver, for example a desktop database, or include a complex full server engine implementation, i.e. for client/server databases running on the server or server cluster 36. Because each application 28 may use a different DBMS, there may be several servers 36 running at the site 20.
Alternatively, the application 28 may use more than one application database 34, with each of the application databases 34 individually distributed through the replication engine 24 and stored on a different server machine or server cluster. It will, however, be understood that there is always exactly one copy of the application database 34 installed at each site 20. If application database 34 is implemented as a logical database, i.e. several physical databases form the database 34, then the local database 34 will include all of its parts.
In a practical system, the IDDB site 20 may take several forms.
For example, the IDDB site may comprise a sales agent's notebook computer or a home user's stand-alone machine. Where the site comprises a single machine, the registry server 22, the replication engine 24, the application database 34, and all instances of the administrator application 26 and the user applications 28 run on the same physical machine.
A medium-sized IDDE site, on the other hand, may comprise a client/server installation at a regional office with several hundred individual users, each of which runs the application 28 and possibly the administrator application 26 from his or her workstation. The application database 34 is stored on a server or server cluster 36, and the replication engine 24 and the registry database run 38 on a communication server connected to the Internet or appropriate networks. With such a system, if the local server 36 is down, then all local users are down. (However, users at other sites are unaffected by the failure of the local server 36). In a larger office, it may be preferable to partition the machines into several IDDB sites. For example, a thousand users are partitioned into ten groups of 100 users, and each group has a copy of the local database 34 installed on its own server or server cluster 36. Each group will also have its own replication engine 24 to replicate frequently with the other nine sites in the same building. Such a configuration exhibits improved fault tolerance, since failure of a server 36 will only affect 1/10th of the users in the office.
Activity Groups
It is a feature of the present invention that each site in the distributed database stores "all and only" the data it needs. To allow users to easily find and choose the data they need to work with from the IDDB database, the database is characterized as a set of "activities" on which users can collaborate, and the activity becomes the unit of collaboration. The following description describes a structure for relationships between sites collaborating on common activities, i.e. activity groups, and in particular how to propagate changes efficiently between the members of the activity group.
According to this aspect of the invention, any site with adequate permissions may attach to an activity. Every activity is attached to by one or more spine sites and zero or more non-spine sites. A spine site comprises a stable site which is defined as a long-term member of the application network and is assumed to have consistently high availability. A non-spine site comprises a transient site which is defined as a short-term participant in the application network or a site having only intermittent availability. Transient sites are typically machines belonging to home or mobile users which may not always be connected to the communications network.
Preferably, the distinction between spine and non-spine sites is invisible to the user, thereby allowing the user to participate fully in the activity regardless of the type of site, i.e. stable or transient.
At a system level, a distinction is made between spine sites and non-spine sites in order to manage availability and propagation of updates across the application network. In the IDDB, non-spine sites never link directly with each other in order to avoid creating dependencies on transient sites with uncertain availability. Only spine sites are assumed to have consistently high availability, and thus a non-spine site reports directly to a spine site, though not necessarily always the same spine site. The spine sites themselves are linked as needed using a spanning tree.
Reference is made to FIG. 5 which shows how an activity is propagated through an activity group in three stages. In FIG. 5, the activity group is denoted generally by 50 and comprises spine sites 52 and non-spine sites 54.
The non-spine sites 54 report changes to the spine sites 52 through a set of links 56. The spine sites 52 report to other spine sites through a set of links 58, and the spine sites 52 report changes to the non-spine sites 54 through a set of links 60. Each link between a pair of sites takes the form of a "database sync" operation. During the database sync operation, the sites determine what record fragments they have and then transmit only updated, i.e. more recent, fragments to each other.
The update propagation procedure involves the following operations. In the first stage, each non-spine site 54 transmits (i.e. link 56) the changed record fragments to one of the spine sites 52. The non-spine site 54 may report to the same spine site 52.
The second stage in the update propagation involves the spine sites 52 sharing all record fragment changes among themselves (i.e. depicted as link 58 in FIG. 5). The changes are shared using a spanning tree established for the spine sites 52. A spanning tree as will be understood by those skilled in the art may take the following form: first, all leaf nodes or `children` in the tree "push" up the changes to their `parents` (i.e. the link 58 is a normal synchronization session); then when the parents have seen all the reports from their children, or else timed out, the parents "push" up the changes to their parents; and the process is repeated until the root is reached. When the root is reached (i.e. the root sees all the reports from its children, or else is timed out) the root "pushes" the changes to its immediate children. As soon as each child receives the changes from the root, the children "push" them to their immediate children, and the process is repeated until the leaves are reached. Generation of a spanning is described in more detail below.
The operation of the second stage of the update propagation may be refined. If during the upward wave (i.e. all links between two adjacent levels in the tree) a child is unable to contact its parent, the child attempts to contact other nodes until a working node is located. Every site contacted unexpectedly by a non-child during the upward wave remembers that node, and the site includes the child in its downward wave to ensure that the child will be notified of all updates.
The third stage in the update propagation procedure involves each of the non-spine sites 54 linking again with the spine site 52 as depicted by link 60 in FIG. 5. Because the spine sites 52 are stable sites (i.e. highly available), the onus is on the non-spite sites 54 to link with the spine sites 52.
It will be appreciated that because all changes made at the non-spine sites 54 are normally scheduled to be made visible to the entire spine group (i.e. through links 56, 58) before the non-spine sites 54 link again with the spine sites 52 (i.e. link 60), all sites will see all updates made before the propagation process began at all sites, i.e. spine and non-spine alike, except for updates made at sites which were unavailable during the propagation.
Generating the Spanning Tree
In the context of the present invention, the spanning tree preferably satisfies the following three requirements: (1) minimum height; (2) weighting by bandwidth; and (3) weighting by availability. When propagating changes it is preferable to have as few "waves" of links as possible. With respect to bandwidth, it is preferable for each node to have as much bandwidth as both of its children combined, so that the parent's bandwidth limits do not slow down the propagation algorithm. With respect to weighting, the most available nodes should be higher in the spanning tree. To meet these requirements, the IDDB uses a balanced binary tree as the spanning tree.
An algorithm for generating a balanced binary spanning tree is shown below.
parm .omega.=weight of site, s .epsilon. S.sub.a (higher is better)
begin sort the list of sites in descending order by weight (* The first node in the list is the root. Each i.sup.th node is the parent of the 2i.sup.th and 2i+1th nodes *)
end
where S.sub.a is the set of all sites attached to activity a .epsilon. A; also known as a's activity group A is the set of all of activities
Two assumptions are made for the spanning tree algorithm. First, the network is a clique, i.e. any node can link with any other node. Second, each node has a single bandwidth value representative of the typical relative communication channel speed between this node and any other node.
The spanning tree generated by the algorithm above has the following features. First, the spanning tree is a minimum-height balanced binary tree. Second, every non-root node has weight no greater than its parent's weight. More generally, every non-root node has a weight no greater than any node in the next higher level. Third, every leaf node has a weight less than or equal to the weight of any internal node. (Note that the leaf nodes need not all be on the bottom level.) Fourth, where node a and node b are on the same level of the tree and the weight of a is greater than the weight of b, the parent of a will have at least as great as the weight of b's parent. The higher-weight nodes at each level are always children of the higher-weight nodes in the next higher level.
In another aspect, the spanning tree algorithm may be optimized as follows:
let k be the smallest integer satisfying N.ltoreq.2k-1 (where N is the number of nodes),
if N.ltoreq.2k-1 (i.e., the bottom level of the tree is incomplete) insert an empty space in the list in front of the last node,
if there is still room at the right of the bottom level, insert another empty space in front of the next-to-last node,
the last step is repeated until there is no more room in the bottom level or the 2.sup.k-1 th mode (i.e. the left-most node in the bottom level) is reached.
Advantageously, the optimization of the spanning tree algorithm preserves the four features (described above) and better distributes the loading between the bottom two levels of the spanning tree.
Reference is next made to FIG. 6(a) which shows an example of a complete balanced spanning tree 70 generated by the algorithm. The spanning tree 70 comprises fifteen spine sites denoted by 72. The sites 72 have similar availability, and their relative bandwidths are 3, 8, 3, 22, 1, 10, 3, 4, 27, 2, 6, 14, 7, 5, 2. Using bandwidth as the only relevant weighting factor, the sites 72 are ordered by weight, 27, 22, 14, 10, 8, 7, 6, 5, 4, 3, 3, 3, 2, 2, 1. Next, the first node is taken as the root and each i'th node is taken as the parent of the 2i'th and 2i+1th nodes, to generate the balanced spanning tree 70 as shown in FIG. 6(a).
Reference is next made to FIG. 6(b) which shows an example of an incomplete spanning tree 74 generated by the algorithm. The spanning tree 74 comprises 25 spine sites denoted generally by 76. The spine sites 76 all have similar availability, and their relative bandwidths are given as 3, 8, 3, 22, 1, 10, 3, 4, 27, 2, 6, 14, 7, 5, 2, 12, 7, 5, 19, 2, 1, 4, 11, 1, 3. Using bandwidth as the weighting factor, the spine sites 76 are ordered as 27, 22, 19, 14, 12, 11, 10, 8, 7, 7, 6, 5, 5, 4, 4,
3, 3, 3, 3, 2, 2, 2, 1, 1, 1. Applying the algorithm and inserting a space in the ordered list where possible in front of the last nodes produces the spanning tree 74 as shown in FIG. 6(b).
It will be understood that the assumption that each node has a single bandwidth value representing the communications channel speed with any other node will not always be a good approximation. For example, in the case of many individual workstations connected to the same WAN, or to the Internet via service providers, the communication speed of site will vary with the communication of other sites due to several factors.
First, external network bottlenecks will affect the speed of the communications channels. A node may be forced to connect with some nodes through a busy remote bottleneck while being able to avoid those bottlenecks when connecting with other nodes. For example, a node in New York usually has faster connections with other nodes in the United States than with nodes in Europe because there is typically lower available bandwidth in cross-Atlantic channels (i.e. undersea cable and satellite links) than in intra-continental channels.
Secondly, external network proximity will affect communication speeds. External network proximity is related to external network bottlenecks, in that, the entire Internet beyond the local network neighborhood or even the local service provider is viewed as a bottleneck. For example, a node in New York usually has faster connections with another node using the same service provider than with other nodes because bandwidth is generally determined only by the capacity of the communication hardware for the two nodes and the capacity of the router at the Internet Service Provider (ISP). Thus, neither node is affected by the speed of the ISP's own connection(s) with the rest of the Internet or beyond.
Thirdly, multiple local physical network connections will also affect the response of the communications channels. A node may have several physical network connections with different bandwidths, and therefore connections to some other nodes (available through the faster network channels) will always have superior performance than connections to others (available only through the slower channels). For example, a node in New York always has faster channels to another node on the same physical LAN than it has with nodes linked via modem connections to the Internet.
Since there can be substantial variation in the channel speed between pairs of sites, each site may store its bandwidth with other sites. Given the communications bandwidths between all pairs of sites in S'.sub.a (where S'.sub.a =S' .andgate.S.sub.a the set of all spine sites attached to an activity group) taken as weights of the links (edges) in the graph of the network, it is possible to use any general-purpose spanning-tree generation algorithm to produce a spanning tree. In the context of the present invention, it is preferable to minimize the diameter of the spanning tree rather than maximizing the use of all available highest-bandwidth links.
Reconfiguring the Spanning Tree
Whenever a spine site is added to the activity group, the new spine site is initially assigned an "entering" state. The entering state is propagated through the spine sites of the activity group, and the new spine site is treated as a non-spine site until all existing spine sites have reported that they have seen the new spine site, i.e. by propagation to the spine sites in the activity group. Once reported, the new spine site is then promoted to a normal state and is treated as a spine site. This procedure is preferred to prevent propagation problems so that all spine sites always agree on the set of spine sites in the activity group.
Similarly, whenever a spine site is to be removed from the activity group, the departing spine site is assigned a "leaving" state. The leaving state for the spine site is propagated through the spine sites of the activity group and the departing spine site continues to be treated as a spine site until all existing spine sites have reported (by propagation, to all the spine sites in the activity group) that they have seen the removal request. The departing site is then assigned a "left the group" state and is no longer treated as part of the activity group.
When spine sites are added to or removed from the activity group, the spanning tree must be reconfigured. The spanning tree is reconfigured using a similar election, or equivalent algorithm to ensure consistency and accuracy even when not all sites are on-line simultaneously in order to agree on the change in real time. Preferably, the reconfiguration does not result in the activity group splitting into separate sub-graphs (as is possible with some simple reconfiguration algorithms).
Repairing the Spanning Tree
In addition to adding and removing spine sites from the activity group, there are two situations that call for the reconfiguration of the spanning tree for an activity group. The first comprises a node failure and the second is a node status change. The node status change is handled as a node reconfiguration described above.
The failure of node is handled as follows. When a working node attempts to contact a failed node, the failed node will not answer. The working node periodically retries contacting the failed node, and upon reaching a timeout or a threshold number of failed links, a "reconfigure" operation as described is initiated. If the failed node never returns, the remaining activity group network is intact and recovery is complete. If the node failure is temporary, the node is still reconfigured out of the spanning tree after the timeout. When the node returns, it will initiate a link with some other node during a propagation attempt. If the returning node was not the root, then it will attempt to link with its former parent during the upward wave of links. If the returning node was the root, it will attempt to link with its former children during the downward wave of links. Regardless of which node it contacts, the returning node will see the calling site as in a "left the group" state, thereby informing the returning node that it has been reconfigured out of the spanning tree (i.e. activity group). The contacted node then initiates a normal "add node" reconfiguration to reattach the returning node as though it had never been in the tree.
The bandwidth values utilized by the spanning tree generation algorithm are provided manually or automatically. The bandwidth values may be entered by the administrator at each site based on known up-time and communications hardware numbers. While such an approach is workable, it relies excessively on human intervention, and automatic generation of the bandwidth values is preferred. According to this aspect, the system includes a bandwidth measurement procedure for calculating the bandwidth and availability rankings of each node, or in the general case, of each communications channel based on actual traffic flow. Advantageously, this feature allows the bandwidth values to be dynamically adjusted to reflect ongoing changes in loading and communications equipment. Preferably, the bandwidth measurement procedure accounts for all IDDBMS traffic, by measuring more than one session to ensure that the measurement is not limited to a low traffic period.
Next, the procedure for mapping non-spine sites to spine sites according to the present invention is described.
According to this aspect of the invention, the IDDB system includes a method for a non-spine site to independently select a spine site so that the total mapping is evenly distributed among the spine sites. Because all sites in an activity group, locally store the set of sites belonging to the activity group, this information is available to a non-spine site when selecting a spine site for attachment.
The operation of the procedure for mapping non-spine sites according to the present invention is described with reference to the following pseudo code listing.
Non-Spine-To-Spine Mapping
__________________________________________________________________________ parm .omega..sub.s = weight of site S.epsilon. S.sub.a (higher is better) var s :list of nodes init S'.sub.a ; (*List of all spine sites in a's activity group.*) n :list of nodes init S.sub.a -S'.sub.a ; (*List of all non-spine sites in a's activity group*). map.sub.1 :integer init 0; (*The i'th non-spine node maps to the map.sub.i th spine node*) pos.sub.i :integer init 0; (*The ith node is the pos.sub.i th to be mapped to map.sub.i.*) try, offset :integer; Initial Mapping: begin sort s in descending order by weight sort n in descending order by weight for i = 0 to .vertline.n.vertline.- 1 do begin map.sub.i = i mod.vertline.s.vertline.; 1 #STR1## end end For ith Node, When Linking At Runtime: begin if map.sub.i is available then select it; else begin (*Try the other spine sites successively, but try them in different order than the other non-spines reporting to this failed spine site (as far as possible) by trying every pos.sub.i th site (mod.vertline.s.vertline.) starting from map.sub.i.*) try = 1; offset = 0; while try <.vertline.s.vertline. and no node has been selected do begin if (map.sub.i - try(pos.sub.i)) mod.vertline.s.vertline. = map.sub.i then offset - offset + 1; if the (map.sub.i - try(pos.sub.i)+offset) mod .vertline.s.vertline.th node is available then select it; try = try + 1; end end (*Continue regular link with selected node.*) end __________________________________________________________________________
The algorithm for mapping non-spine sites as shown above assumes that any non-spine site is free to link with any spine site. The algorithm is run locally at each non-spine site and results in a balanced load distribution for non-spine to spine site links.
Advantageously, the algorithm for mapping non-spine sites-to-spine sites generates a mapping with the following features. First, the number of non-spine sites reporting to each spine site is balanced. Secondly, the higher the weight (i.e. bandwidth) of a spine site, the higher the total weight of the non-spine sites mapped to it. Thirdly, as a result of the failure rollover provision, the non-spine nodes mapped to an unavailable spine node are distributed fairly among the available nodes. In the case of a massive failure, e.g. where a large number of spine sites are simultaneously unavailable, the algorithm still provides a fair distribution. By choosing -try(pos.sub.i) (rather than +try(pos.sub.i) in the central distribution expression (see algorithm above), the higher-weight non-spine sites in a failed group will tend to try the higher-weight spine sites first, while the lower-weight non-spine sites in the group will try the lower-weight spine sites first.
Reference is next made to FIG. 7 which shows an example of non-spine site to spine site mapping generated by the algorithm described above. The mapping is denoted generally by reference 80 and comprises spine sites 82 (shown individually as 82a,
82b, 82c, . . . ) and non-spine sites 84 (shown individually as 84a, 84b . . . ). In the example for FIG. 7, the spine sites 82 have the following bandwidths 22, 8, 19, 3 and 16, and the non-spine sites 84 have the bandwidths 5, 9, 3, 16, 2, 1, 13, 4,
2, 8, 3, 21, 7, 4, 3, 17, 1, 3. Following the ordering steps, the spine sites 82 and non-spine sites 84 are arranged as follows:
spine sites:
22 19 16 8 3
non-spine sites:
21 17 16 13 9 8 7 5 4 4 3 3 3 3 2 2 1 1
Following the mapping step, the mapping 80 is produced as shown in FIG. 7. From FIG. 7, it can be seen that the spanning tree 80 is balanced, and the higher the weight of a spine site, the higher the total weight of its mapped non-spine sites. For example, spine site 82a has the highest weight (i.e. 22), and accordingly, the total weight of the non-spine sites 84a-84d is the highest (i.e. 34). If spine site 82d (weight-8) is not available, then the algorithm tries to connect the non-spine site 84e (weight-13) to the weight-16 site 82c, the weight-19 site 82b, the weight-22 site 82a, or weight-3 site 82e and in that order. The algorithm then tries to connect the weight-4 site 84f to the weight-19 spine site 82b, to the weight-3 site 82e, to the weight-16 site 82c, or to the weight-22 site 82a and in that order. The algorithm then tries to connect the weight-3 non-spine site 84g first to the weight-22 spine site 82a, the weight-16 site 82c, the weight-3 site 82e, or the weight-19 site
82b, and in that order.
According to this aspect of the invention, whenever a non-spine site is added or removed from the activity group, the only effect is to change the non-spine to spine site mapping. It is however to be understood that when any site is added or removed from the activity group, the non-spine nodes or sites will not be aware of the change until after their next link. This means that there will be a window of one or more links for each non-spine site where it will choose a spine site based on old information, thereby resulting in a loading which is not optimally distributed. The effect will however be temporary and is self corrected as the nodes become aware of the changes to the activity group.
Relative Clocks and Reference Time Agreement in the IDDB
As described above the IDDB 1 comprises a database which is distributed or spread over the sites belonging to the application network, and each site works independently on its own data. To propagate changes made to the database, the IDDB includes a procedure for updating changes to fragments at different sites in the activity group. In the context of the present invention, a fragment is a piece of an individual record and comprises a subset of columns in a record. The underlying principle for this aspect of the present invention is that the "most recent fragment survives".
The implementation of the "most recent fragment survives" procedure depends on two sites being able to agree on the age of a fragment. According to this aspect of the invention, it is not necessary that two sites agree on the actual date and time the fragment was changed, rather the two sites may safely share data as long as they can reliably agree on the age of the fragment. This is accomplished by storing a time-stamp with the fragment at the site, and according to this aspect of the invention every fragment has exactly one time-stamp.
The time-stamp comprises an actual date-and-time time-stamp relative to the clock of the local site clock recording when the fragment was last modified. Thus, a fragment will usually have different time-stamps at different sites, but as the site clocks move forward, the age of the fragment increases naturally and all sites are able to agree on the fragment's age at any given time. In other words, each fragment will have a relative age determined by its actual time-stamp at any site relative to the system clock of that site.
For example, site A (e.g. site 14b in FIG. 1) has a fragment f.sub.A with a time stamp t.sup.f.sub.A. The local time for site A is t.sub.A and the local time a site B (e.g. site 10a in FIG. 1) is t.sub.B. If the time of the fragment t.sup.f.sub.A is 1:00 (hours:minutes), the local time at site A is t.sub.A =3:00 and the local time at site B is t.sub.B =3:10, then the time-stamp for the fragment at site B is t.sup.f.sub.B is 1:10. While sites A and B do not agree on the actual time the fragment was last modified, the age of the fragment is two hours at both sites, i.e. t.sub.A -t.sup.f.sub.A =2:00 and t.sub.B -t.sup.f.sub.B =2:00. Without any changes to the local clocks, the sites A, B at time t+1:00 will agree that the fragment f is 3:00 hours old, at time t+6:00 the sites will agree that the fragment f is 7:00 hours old, and so on.
Because the local system clocks at the sites may be unreliable, e.g. easily reset by the user or not tamper-resistant, the clocks are vulnerable to drift and various degrees of deliberate tampering. In another aspect, the present invention includes a procedure for drift-resistant clocks as described below.
According to this aspect, all sites agree on the age of any fragment the sites share in common, within a tolerance of at most .delta., according to the invariant given by the following expression:
Fragment Age Agreement Invariant ##STR2## where .delta. is a constant; S is the set of all sites; F.sub.s is the set of fragments at site s; a.sup.f.sub.s is the age of fragment f at site s.
where .delta. is related to the maximum acceptable amount of clock drift between any two sites. Typically, the acceptable clock drift .delta. will have a value of 1.1 hours so that if one site changes to or from daylight savings time before another site, the site will still be able to communicate.
According to the invention, the relative clock procedure establishes a time invariant when a record is created, by requiring that all fragment time-stamps for a newly created record be set to the current time of the local system clock. Once created at one site, the fragments may be propagated to other sites and the invariant is preserved.
There are two principal situations where sites need to compare or transmit time-stamps for fragments. The first situation involves determining what fragments are new by comparing their ages. The second situation concerns determining the fragment time-stamps when actually transmitting the fragments. In this aspect, in every on-line conversation, the two sites first agree on a reference time for the start of the current conversation. The sites then compare and transmit all fragment time-stamps, not as actual time-stamps, but as fragment ages expressed as offsets from the agreed reference time. Each site then stores the reference time in terms of its local clock. The two sites must also control relative clock drift between them over time, so that fragments at one site do not age faster than at another site. It will be appreciated that such a result would cause inconsistency and integrity problems. To control relative clock drift, the sites remember historical statistics about the relative time differences between their local clocks and use this information to decide whether fragments may be safely transmitted while preserving the invariant. (A procedure for drift-resistant clocks is described in more detail below.)
The operation of the procedure for a two-message protocol to establish agreement on a reference time between sites is described with reference to the following pseudo code listing.
Reference Time Agreement (Two-Message Version)
__________________________________________________________________________ var t.sub.start, t.sub.end :time; (*Local timing for echo message's send/return*) t.sub.ref :time; (*Result: agreed reference time (in terms of local clock)*) For the initiator: For the receiver: begin begin t.sub.start = current.sub.-- time; send (token.sub.request); (*M1*) receive (token.sub.request); t.sub.ret = current.sub.-- time; send (token.sub.reply); (*M2*) receive (token.sub.reply); t.sub.and = current.sub.-- time; 2 #STR3## end end __________________________________________________________________________
At the beginning of a conversation between two sites, i.e. an initiator and a receiver, the sites agree on a reference time using the two-message protocol. According to the protocol, the initiator site S1 sends one message, M1, and the receiver site S2 sends one reply, M2. The receiver site S1 uses the arrival time of the message M1 as its reference time. The sender site S2 calculates the time interval from the time message M1 was sent t.sub.start to the time message M2 was received tend, and the midpoint of this time interval becomes the reference time for the sending site S2.
The total transmit time should be less than twice a maximum allowed skew .sigma..sub.max. If the total transmit time exceeds the maximum allowed skew .sigma..sub.max, the two-message procedure is repeated until the tolerance is achieved or until a timeout is reached. In a practical system, a maximum skew .sigma..sub.max of 1 second is typical to obtain better-than-one-second timing accuracy, which means that the timing pings must have round trip times under 2 seconds. For networks with slower response times, a slightly higher value for the maximum skew .sigma..sub.max is chosen.
According to this aspect of the invention, there is also provided a procedure utilizing a three-message protocol for reference time agreement. The three-message reference time agreement procedure is described with reference to the following pseudo code listing.
Reference Time Agreement (N-Message Averaged Version, N=3)
__________________________________________________________________________ var t.sub..DELTA.ir :time; (*Local estimate of difference in initiator and receiver clocks*) t.sub.start, t.sub.end :time; (*Local timing for echo message's send/return*) t.sub.remote :time; (*Remote time stamp for receipt of echo message*) t.sub.remote.sbsb.--.sub..DELTA.ir :time: (*Remote estimate of difference in initiator and receiver clocks*) t.sub.ref :time; (*Result: agreed refernce time (in terms of local clock)*) For the initiator: For the receiver: begin begin t.sub.start = current.sub.-- time; send (token.sub.request); (*M1*) receive (token.sub.request); ,t.sub.start = current.sub.-- time; receive (t.sub.remote); send (t.sub.start); (*M2*) t.sub.end = current.sub.-- time; 3 #STR4## t.sub.ref = t.sub.end ; send (t.sub.ref,t.sub..DELTA.ir); (*M3*) receive (t.sub.remote,t.sub.remote.sbsb.--.sub..DELTA.ir); . t.sub.end = current.sub.-- time; 4 #STR5## if t.sub..DELTA.ir <t.sub.remote.sbsb.--.sub..DELTA.ir begin swap(t.sub..DELTA.ir,t.sub.remote.sbsb.--.sub..DELTA .ir); end 5 #STR6## end end __________________________________________________________________________
According to the three-message protocol, the initiator site S1 sends a message M1, the receiver site S2 replies with message M2, and the initiator site S1 then replies with message M3. The initiator site S1 calculates the midpoint between the time when message M1 was sent and the time message M2 was received. This value is compared with the local time at the receiver site S2 when the message M1 was received in order to estimate the relative difference between the clocks of the two sites. The receiver site S2 repeats these operations with the messages M2 and M3. The two estimates are then averaged, and the initiator site S1 chooses a reference time relative to its local clock, and the receiver site S2 uses the average difference to calculate the corresponding reference time relative to its own local clock, and thus, both local reference times will reflect the same real time.
As described, the three-message reference time procedure utilizes an extra message in order to achieve better reliability on inconsistent communication lines. Both sites S1 and S2 estimate the difference in the initiator and receiver clocks t.sub..DELTA.ir and then use the average to determine local reference times. Using the three-message protocol, the system is less sensitive to transient communication latencies than in the first two-message protocol.
It will be appreciated that the three-message protocol described above can be extended to any number of messages, where each extra message completes another pair that can be used to obtain an additional comparison point, all of which are used in the final average of estimated differences between the local clocks a the two sites S1 and S2.
It is noted that the N-message procedure is preferable to the two-message procedure for two important reasons. First, the greater the number of messages used, the less the probability of reaching the worst-case skew. Secondly, the two-message procedure (described above) allows only one site, i.e. the initiator site S1, enough information to estimate the actual clock difference t.sub.ir between the two sites, i.e. only the initiator S1 sees and times a full round-trip message pair.
As in the case for the two-message procedure, the total transmit time for any message pair should be less than twice the allowed maximum skew .sigma..sub.max. If not, the message pair is not considered as a data point and the procedure continues until the desired number of valid message pairs is reached or until a timeout threshold is reached.
Once the two sites S1 and S2 agree on a reference time, the sites proceed with transmitting all fragment time-stamps as delta's or offsets from the reference time. For example, consider sites A and B which have agreed on their reference times t.sub.A =3:00 (local time at site A) and t.sub.B =3:10 (local time at site B), and the local time stamp of the fragment t.sup.f.sub.A is 1:00. The age of the fragment a.sup.f.sub.A is then 2:00 (hours:minutes), i.e. t.sub.A -t.sup.f.sub.A. If the fragment f does not exist at site B but should be replicated at B, then the fragment f is transmitted to site B along with its age a.sup.f.sub.A (as known to site A). The fragment f is stored at site B and the age of fragment at site B is a.sup.f.sub.B =2:00 (i.e. t.sub.B -t.sup.f.sub.B) and the invariant is preserved. If the fragment f exists at site B and has the same age as at site A (i.e. a.sup.f.sub.A =a.sup.f.sub.B), then no further no adjustments need be done and the invariant is preserved. If the fragment f exists at site B but is older than the fragment f at site A (i.e. a.sup.f.sub.A <a.sup.f.sub.B), then the newer fragment f at site A is transmitted to site B along with its age a.sup.f.sub.A =2:00. The newer fragment f is stored at site B and the invariant is preserved. The other situation occurs where the fragment f at site B is the newest (i.e. a.sup.f.sub.A >a.sup.f.sub.B), and in this case, the fragment is transmitted to site A with its age a.sup.f.sub.B (as known to site B), and stored at site A, and the invariant is preserved.
Drift-Resistant Clocks
In another aspect, the present invention provides a method for correcting accumulated clock drift which may occur between sites in the network.
The effects of accumulated clock drift are first considered by way of the following example.
Sites A and B which share a fragment f. Initially, the clocks at both sites A and B are accurate. On June 1, a user at site A updates fragment f and the update needs to be reported to site B. Site A links to site B and updates fragment f at site B, and disconnects. At this point, the time-stamps for fragment f between the sites A and B are consistent, i.e. both may show slightly different actual time-stamps, but both will agree on the age of the fragment.
If the clock at site B is altered, e.g. the clock at site B is turned back six days. On June 5, a user at site A updates fragment f, and on June 7, site A links to site B. In the absence of any mechanism to check the desynchronization between the times at site A and site B, the replica of fragment f at site B will appear to be more recent (i.e. one day old) whereas the fragment f at site A which will appear to be older (i.e. two days old). This occurs even though the fragment f at site B is actually an old update from site A. The older version of the fragment f will overwrite the newer version of the fragment and consequently the changes to the fragment f at site A will be lost. (A similar result occurs when a site's clock is moved forward.)
A partial solution to this problem involves having each site store the last delta clock difference between the site and any site to which it links. When the sites link again, the sites compare clocks and also compare the delta to the delta stored from the last link. If the two delta's are off by more than a set amount (e.g. 1.1 hours, to allow for daylight savings time differences), or if the two delta's differ by more than the elapsed time since the last successful link, then the link is rejected. It will be appreciated that this approach puts an upper bound on the clock changes that can be permitted and detected between two consecutive links. If only the delta from the last link is compared, then multiple changes in the same direction from link to link will allow two sites' clocks to drift with respect to each other over time. For example, this can happen if one clock runs slightly faster or slower than another. It can also open a window opportunity for attackers.
According this aspect of the invention, the problem of clock drift between sites in the network is addressed by extending each site's knowledge of its link histories. Instead of storing the delta from the last link with each other site, each site i stores the maximum and minimum historical delta's t.sub..DELTA.ijmax and t.sub..DELTA.ijmin, determined in all past links to each other site j. According to this aspect of the invention, problems resulting from gradual drift over time are eliminated by preserving the invariant according to a procedure as illustrated with reference to the following pseudo code listing.
Clock Drift Control Procedure
__________________________________________________________________________ var t.sub..DELTA.ir.sbsb.max, t.sub..DELTA.ir.sbsb.min :time; (*Max and min deltas to remote site r, stored across successive links*) For the first link between i and r. begin (*Calculate this session's t.sub..DELTA.ir using a reference time agreement protocol see above*) t.sub..DELTA.ir.sbsb.max = t.sub..DELTA.ir ; t.sub..DELTA.ir.sbsb.min = t.sub..DELTA.ir ; end For each subsequent link: begin (*Calculate this session's t.sub..DELTA.ir using a reference time agreement protocol see above*) (*This t.sub..DELTA.ir must not cause t.sub..DELTA.ir.sbsb.max - t.sub..DELTA.ir.sbsb.min to exceed .delta. - .sigma..sub.max *) if(t.sub..DELTA.ir.sbsb.max - (.delta. - .sigma..sub.max) > t.sub..DELTA.ir or t.sub..DELTA.ir.sbsb.min + (.delta. - .sigma..sub.max ) < t.sub..DELTA.ir) reject link; else if (t.sub..DELTA..sbsb.ir > t.sub..DELTA.ir.sbsb.max) t.sub..DELTA.ir.sbsb.max = t.sub..DELTA.ir ; else if (t.sub..DELTA.ir < t.sub..DELTA.ir.sbsb.min ) t.sub..DELTA.ir.sbsb.min = t.sub..DELTA.ir ; __________________________________________________________________________
It will be understood that the value for .delta. should be chosen as low as possible in the importance of catching gradual clock drifts. In applications where daylight savings time changes are not important, or could produce errors, choosing a lower value for .delta., e.g. ten minutes, will further reduce exposure to tampering attempts.
If the clocks at two sites are altered by the same amount and in the same direction, the delta remains the same and the two sites are still be able to communicate. This is acceptable because it means that the two sites will still agree on the relative ages of all fragments they hold in common and therefore they may still safely share data.
There may, however, be situations where the clock of one site is tampered with between communication sessions in such a way that the clock drift control procedure shown above cannot detect the change. For instance, a user might change the system clock forward, or backward, make changes to the database resulting in fragments whose apparent ages are less, or more, than their true ages and real time, and then reset the system clock to correct the time. When the site connects to other sites, its clock will appear to be acceptable according to the above protocol but its fragments will have ages that are incorrect.
To handle such situations, the replication engine includes a procedure to log an audit trail whenever a future-dated time-stamp is encountered. The processing of the fragment with logged time-stamp is then up to the application designer. In one scenario, the fragment is deemed corrupt and rejected. The fragment is overwritten by any other version of the same fragment obtained from another site.
In another scenario, the fragment is treated as current. The fragment's time-stamp is set to the current time, stored in the local database at the site, and then the fragment is replicated normally by the replication engine. Such an implementation allows a legitimate user to change a record and set the time-stamp to the future so that the change remains in the database and cannot be undone by other users, but future changes are still able to override and propagate normally. In a similar fashion, the fragment is back-dated by a legitimate user to appear to be done in the far past. Because such a change appears to be older, the fragment is replaced by either the original version of the fragment that was changed or by newer versions of the fragment changed at other sites.
Checked Clocks
As described above, the previous procedures maintain system integrity even in the presence of local clock tampering at one or more sites. In another aspect, the present invention includes a checked clock procedure to better manage local site clocks.
The checked clock procedure has two principal features. First, the checked clock procedure prevents many tampering attempts and accidental clock errors in the first instance. Secondly, the procedure ensures consistent clock time-stamps within a site. Because each site may have separate workstations, there may be several mutually inconsistent clocks. Therefore, if several workstations update local fragments at the same real time, the fragments should still have consistent time-stamps, and therefore ages, regardless of the clock setting of the local workstation through which the update was performed.
The checked clock procedure first identifies one machine and one process at each site which has fairly high availability and can act as a designated time keeper. For example, the replication engine 24 (FIG. 4) is suitable. Each workstation checks its clock and defers to the replication engine's checked clock, so that on start-up, each application uses a reference time agreement protocol to determine the time t.sub..DELTA.wr, the difference between the clock of the workstation and the reference clock of the replication engine for example. Whenever a fragment is modified, its time-stamp is set to the workstation's local clock offset by the time t.sub..DELTA.wr. According to this procedure, two fragments updated at the same time and at the same site by two different workstations will differ in age by at most 2 .sigma..sub.max.
The next step in the checked clock procedure involves protecting against clock changes after the replication engine and the workstations have been started. On many platforms, there are calls to get the time since the system was started and the results will be consistent even if the user changes the system clock. If available, these calls can be easily used instead of the usual clock interrogation functions to generate correct time-stamp data. No further checking is necessary except to handle the case where the tick count wraps back to zero.
In a system where there is no tamper proof way to measure the passage of time since the initial reference agreement, for example tick counts, the checked clock procedure relies on the system clock as follows. Each machine's checked clock, including the replication engine's clock, sets up a system timer that will invoke a call-back function at regular intervals, for example every 60 seconds. On each timer event, the checked clock procedure compares the current system time with the stored system time from the last event. The difference should b