United States Patent Application20030217033
Kind CodeA1
Sandler, Zigmund ; et al.November 20, 2003

Database system and methods
Abstract
In general, in one aspect, the invention relates to a method for transaction processing. The method includes specifying metadata and storing the metadata. An index is created in response to the stored metadata. The method also includes receiving a transaction, generating an index log of changes to the index in response to the received transaction, and modifying the first index in response to the generated index log.

Inventors:Sandler; Zigmund (Springfield, NJ), Seroff; Vladimir  (New York, NY), Riecke; Jon G.  (Maplewood, NJ), Kolodzieski; Scott J.  (Millington, NJ)
Correspondence Name and Address:HIGH STREET TOWER 125 HIGH STREET
TESTA, HURWITZ & THIBEAULT, LLP
BOSTON
MA
02110
US
Series Code:150763
Filed:May 17, 2002
U.S. Current Class:707/1
U.S. Class at Publication:707/1
Intern'l Class:G06F 007/00

Claims


What is claimed is:
1. A method for transaction processing, comprising the steps of: specifying metadata describing database elements and relationships between the database elements; storing the metadata; creating a first index in response to the stored metadata; receiving a transaction; generating an index log of changes to the first index in response to the received transaction; and modifying the first index in response to the generated index log.

2. The method of claim 1, further comprising specifying database elements required for a query.

3. The method of claim 1, wherein the specifying step further comprises specifying data dependencies between database elements.

4. The method of claim 1 wherein the specifying step is performed by a user interface subsystem.

5. The method of claim 1 wherein the receiving step, the generating step, and the modifying step are performed by a transaction subsystem.

6. The method of claim 1 wherein the transaction is an insert.

7. The method of claim 1 wherein the transaction is an update.

8. The method of claim 1 wherein the transaction is a remove.

9. The method of claim 8, further comprising: determining, based on the metadata and the log, additional database changes required by the transaction.

10. The method of claim 1 wherein the step of modifying the first index is performed when the transactions are complete.

11. A system for transaction processing, comprising: a database for storing metadata; a transaction processing subsystem for creating a first index in response to the stored metadata, receiving a transaction, generating an index log of changes to the first index required by the received transaction, and modifying the first index based on the generated index log.

12. The system of claim 11, wherein the transaction processing subsystem comprises: a transaction manager for managing the state of the transaction processing subsystem; an adapter manager for receiving information on transaction sources from the transaction manager; an adapter listener in communication with the adapter manager for receiving transaction data from a queued transaction data source; a resource manager in communication with the adapter listener for receiving data from the adapter listener and for logging collected data; a minimum recalculation engine for recalculating portions of database tables in response to the logged collected data and the metadata; an internal check point manager for committing logged collected data to a database table; and an external checkpoint manager for publishing changes to target databases.

13. The system of claim 12 further comprising a second resource manager in communication with a second adapter listener for receiving transaction data from a second queued transaction data source and for logging collected data.

14. The system of claim 12 further comprising a second minimum recalculation engine for recalculating portions of database tables in response to the logged collected data and the metadata.

15. The system of claim 14 wherein the minimum recalculation engine performs calculations only on data that has changed.

16. The system of claim 14 wherein the minimum recalculation engine performs table operations on indicies that represent results of operations.

17. A database system, comprising: a database storing data and storing metadata describing database elements and relationships between the database elements; a transaction processing subsystem for processing transactions and updating the database; and a user interface for querying the database and continuously providing updated query results as transactions are processed by the transaction processing subsystem.

18. The system of claim 17 wherein the system further comprises a second database storing data, and the transaction processing system updates the first database and the second database.

19. The system of claim 17 wherein the user interface comprises a web browser in communication with a web server, wherein the web server provides the results to the web browser for display to a user.

20. A database system, comprising: a vectorized database storing table data fields linearly as a block of contiguous data; a database storing metadata describing database elements and relationships between database elements; a user interface for querying the database and providing updated query results; and a batch processing subsystem for processing batch queries.

21. The method of claim 20 wherein the batch processing subsystem uses a memory model selected from the set of vertical partitioning, horizontal partitioning, and a blend of vertical partitioning and horizontal partitioning.

22. The system of claim 20, further comprising a transaction processing subsystem.

23. The system of claim 20, wherein the user interface further comprises a web browser in communication with a web server.

24. The method of claim 20, wherein the vector database operates on data as vectors.

25. A method for external checkpointing, comprising the steps of: initially communicating a data table and a log comprising entries of data table transactions to a subscriber; and communicating additional log entries to the subscriber when they are received.

26. The method of claim 25, further comprising the steps of determining that the number of log entries is above a predetermined threshold, applying the log entries to the data table, and communicating the updated data table to the subscriber.

27. The method of claim 25 wherein the subscriber comprises an OLAP server.

28. An external checkpointing subsystem, comprising a transmitter for initially communicating a data table and a log comprising entries of data table transactions to a subscriber; and for communicating additional log entries to the subscriber when they are received.

29. The method of claim 28 wherein the subscriber comprises an OLAP server.

30. A method for fault-recoverable, non-blocking checkpointing of table data, comprising the steps of: storing a first copy of a data table and a second copy of the data table; receiving a log comprising entries of data table transactions; applying the log entries to the first copy of the table; swapping the first copy of the table and the second copy of the table; and applying the log entries to the second copy of the table.

31. The method of claim 30, wherein the first copy of the data table and the second copy of the data table are stored on disk.

32. The method of claim 31, wherein the swapping step comprises renaming the first copy of the table and renaming the second copy of the table.

33. The method of claim 30, wherein the logs comprise inserts, edits, and deletes to the data table.

34. The method of claim 30, wherein the step of applying the logs to the first copy of the table comprises modifying the copy of the table in response to the log entries.

35. The method of claim 30, further comprising archiving the logs once they are applied to the second copy of the table.

36. The method of claim 30, wherein the step of receiving a log comprises receiving a log of data table transactions comprising entries already applied to the first copy of the table before an interruption and entries not applied to the first table before the interruption, and wherein the step of applying the log to the to the first copy of the table comprises applying the unapplied log entries to the first copy of the table.

37. The method of claim 30, wherein and wherein the step of applying the log entries to the second copy of the table comprises applying unapplied log entries to the second copy of the table.

38. A checkpointing system for fault-recoverable, non-blocking checkpointing of table data, comprising: a data store for storing a first copy of a data table and a second copy of the data table; a receiver for receiving logs of data table transactions; an first updater for applying the logs to the first copy of the table; a swapper for swapping the first copy of the table and the second copy of the table; and a second updater for applying the logs to the second copy of the table.

Description



FIELD OF THE INVENTION

[0001] The invention relates to computer-based data processing, more particularly, to database management systems.

BACKGROUND OF THE INVENTION

[0002] Data in database management systems are typically stored in the form of records, or tuples, each composed of a fixed number of fields, also referred to as attributes. The fields of a record contain the data associated with that record. Frequently, database records are presented logically in the form of a table, with records as the rows of the table, and attributes as the columns. Systems typically store records in memory and/or on disk or other media as a linked list, with data for each record stored together. However, the data for adjacent records or even adjacent values of the same field are not necessarily stored in any particular proximity or order.

[0003] The manner in which the data is stored presents inherent limitations on the performance of database systems. For example, typically, on-line transaction processing (OLTP) is performed using one database system or computer, and on-line analytical processing (OLAP) is performed on another computer. Data typically is offloaded from a transaction processing system to a data warehouse, and the data in the data warehouse is used for analytical processing. There frequently are significant time delays associated with the transfer of data from one system to another. Analytical processing frequently takes a significant amount of time. Typically, the analytical processing on a data warehouse is performed on an entire table or set of tables, even when only a small portion of the table(s) changed. This also can be very inefficient in time and resources.

SUMMARY OF THE INVENTION

[0004] In view of the foregoing, there is a need for systems and methods to store and manipulate data so as to avoid the inefficiencies of the prior art. Embodiments of the present invention, for example, perform transaction processing and traditional analytical processing without use of a separate data warehouse, and further, are capable of providing real-time analytical processing of transaction data.

[0005] In general, in one aspect, a declarative, vectorized, metadata-enabled, relational query language supports heterogeneous and homogeneous data sources. The language allows for specification of vertical and horizontal vectorized functions, and uses a construct (for example, a "mapping" function) that can be used to simply specify complex rules and relationships such as those specified in a directed acyclic graph. Using another language such as SQL, a database user would typically need to specify query terms such as FROM, WHERE, GROUP BY, ORDER BY, and HAVING, to design such a query, but here the user only specifies relationships between table columns. User's query language code is processed and converted into metadata that describes tables and relationships between the tables. That metadata is then used throughout the system to enable features that previously were not possible, as part of batch processing, transaction processing, OLAP and SQL query processing, internal and external checkpointing, and so on.

[0006] In general, in one aspect, the invention relates to a method for transaction processing. The method includes specifying metadata and storing the metadata. An index is created in response to the stored metadata. The method also includes receiving a transaction, generating an index log of changes to the index in response to the received transaction, and modifying the first index in response to the generated index log.

[0007] The metadata can include a description of database elements and relationships between the database elements. The metadata can also include a description of database elements required for a query. The metadata can also include a functional description of the relationships between database elements, and can include a description of data dependencies between database elements.

[0008] The step of specifying metadata can be performed through use of a user interface operated by a user. In one embodiment, the user interface is a graphical user interface that allows for graphical display and modification of the metadata.

[0009] In one embodiment, the receiving step, the generating step, and the modifying step are performed by a transaction subsystem. The transaction can be, for example, an insert, update (i.e. modify), or remove (i.e. delete) operation to a data table. Based on the metadata and the log, additional database changes required by the transaction may be determined.

[0010] In some embodiments, he step of modifying the first index is performed when the transactions are complete.

[0011] Embodiments of the method can be implemented by a database system for transaction processing. The system includes metadata, a database for storing the metadata, and a transaction processing subsystem or module for creating a first index in response to the stored metadata, receiving a transaction, generating an index log of changes to the first index required by the received transaction, and modifying the first index based on the generated index log.

[0012] In one embodiment, the system includes a transaction manager for managing the state of the transaction processing subsystem; an adapter manager for receiving information on transaction sources from the transaction manager; an adapter listener in communication with the adapter manager for receiving transaction data from a queued transaction data source; a resource manager in communication with the adapter listener for receiving data from an adapter listener and for logging collected data; a minimum recalculation engine for recalculating portions of database tables in response to the logged collected data and the metadata; an internal check point manager for committing logged collected data to a database table; and an external checkpoint manager for publishing changes to target databases. In various embodiments there can be one or more of each of the elements. For example, the system can include a second resource manager in communication with a second adapter listener for receiving transaction data from a second queued transaction data source and for logging collected data.

[0013] The minimum recalculation engine recalculates portions of database tables in response to the logged collected data and the metadata. Typically, the minimum recalculation engine performs calculations only on data that has changed. In preferred embodiments, the minimum recalculation engine performs table operations on indicies that represent results of operations.

[0014] In general, in another aspect, the invention relates to a database system. The database system includes a database storing data and metadata describing database elements and relationships between the database elements; a transaction processing subsystem for processing transactions and updating the database; and a user interface for querying the database and continuously providing updated query results as transactions are processed by the transaction processing subsystem.

[0015] There can be one or more of each of these elements; the system can include a second database storing data, and the transaction processing system can updates the first database and the second database. The user interface can include a web browser in communication with a web server, wherein the web server provides the results to the web browser for display to a user.

[0016] In general, in another aspect, the invention relates to a database system. The system includes a vectorized database storing table data fields linearly as a block of contiguous data; a database storing metadata describing database elements and relationships between database elements; a user interface for querying the database and providing updated query results; and a batch processing subsystem for processing batch queries.

[0017] In some embodiments, the batch processing subsystem uses a memory model selected from the set of vertical partitioning, horizontal partitioning, and a blend of vertical partitioning and horizontal partitioning. In some embodiments, the system also includes a transaction processing subsystem, such as that described above. In some embodiments, the system also includes a user interface as described above. The user interface can include a web server in communication with a web browser or applet.

[0018] In general, in another aspect, the invention relates to a method for external checkpointing. The method includes initially communicating a data table and a log comprising entries of data table transactions to a subscriber; and communicating additional log entries to the subscriber when they are received. The method includes determining that the number of log entries is above a predetermined threshold, applying the log entries to the data table, and communicating the updated data table to the subscriber. In sone embodiments, the subscriber is an OLAP server or other database system. Embodiments of the method can be implemented in a checkpointing subsystem that includes a transmitter for initially communicating a data table and a log comprising entries of data table transactions to a subscriber; and for communicating additional log entries to the subscriber when they are received.

[0019] In general, in another aspect, the invention relates to a method for fault-recoverable, non-blocking checkpointing of table data. A first copy of a data table and a second copy of the data table are stored. A log comprising entries of data table transactions is received. The log entries are applied to the first copy of the table. When the application of the log entries to the first copy of the table is complete, the log entries are applied to the second copy of the data table. The first copy of the table and the second copy of the table are swapped, and the log entries are applied to the second copy of the table.

[0020] In one embodiment, the first copy of the data table and the second copy of the data table are stored on disk. The swapping step then includes renaming the first copy of the table and renaming the second copy of the table. The copies can also be stored in some combination of memory and on disk. In some embodiments, the logs include entries such as inserts, edits, and deletes to the data table. In some embodiments, the step of applying the logs to the first copy of the table comprises modifying the copy of the table in response to the log entries. The logs can be archived once they are applied to the second copy of the table.

[0021] This method is also applicable to fault recovery. In such a case, the step of receiving a log comprises receiving a log of data table transactions comprising entries already applied to the first copy of the table before an interruption and entries not applied to the first table before the interruption. The step of applying the log to the to the first copy of the table includes applying the unapplied log entries to the first copy of the table, or the step of applying the log entries to the second copy of the table includes applying unapplied log entries to the second copy of the table, as appropriate.

[0022] Embodiments of such a method can be implemented in a checkpointing subsystem configured for fault-recoverable, non-blocking checkpointing of table data. Such a subsystem can include a data store for storing a first copy of a data table and a second copy of the data table, and a receiver for receiving logs of data table transactions. The subsystem can also include a first updater for applying the logs to the first copy of the table; a swapper for swapping the first copy of the table and the second copy of the table; and a second updater for applying the logs to the second copy of the table. The first and second update can be the same or different updaters.

BRIEF DESCRIPTION OF THE DRAWINGS

[0023] The invention is identified with particularity in the claims. The advantages of the present invention may be better understood by referring to the following description and the accompanying drawings, in which:

[0024] FIG. 1 illustrates an embodiment of a database system in accord with the present invention;

[0025] FIG. 2 depicts an embodiment of the database module 100 of FIG. 1;

[0026] FIG. 3 shows an embodiment of the user interface module 104 of FIG. 1;

[0027] FIG. 3A illustrates another embodiment of the user interface module 104 of FIG. 1;

[0028] FIG. 4 depicts an embodiment of the data generator module 108 of FIG. 1;

[0029] FIG. 5 shows an embodiment of the external interface module 124 of FIG. 1;

[0030] FIG. 6 illustrates an embodiment of the OLAP services module 112 of FIG. 1;

[0031] FIG. 7 depicts an embodiment of the batch processor module 120 of FIG. 1;

[0032] FIG. 8 shows an embodiment of the transaction processor module 116
of FIG. 1;

[0033] FIG. 9 is a flowchart illustrating an embodiment of a method for specifying a data model and a set of computational rules accompanying the model in accord with the present invention;

[0034] FIG. 10 is a flowchart depicting an embodiment of a method for processing a g-table in accord with the present invention;

[0035] FIG. 11 is a flowchart showing an embodiment of a method for computing the expressions underlying a combined relation specified by a user in accord with the present invention;

[0036] FIG. 12 is a flowchart illustrating an embodiment of an iterative algorithm for transforming source relations into an expression using the fuse, link, and loop operators;

[0037] FIGS. 13A-B show an example of the state information that is maintained by transaction manager 802 of FIG. 8;

[0038] FIG. 14 is an example showing dependencies between tables;

[0039] FIGS. 15A-C show an example of the handling of transactions in a fuse operation in accord with minimal recalculation methods of the present invention;

[0040] FIGS. 16A-C show an example of the handling of transactions in a link operation in accord with minimal recalculation methods of the present invention;

[0041] FIGS. 17A-C show an example of the handling of transactions in a loop operation in accord with minimal recalculation methods of the present invention;

[0042] FIGS. 18A-B show an example of the handling of transactions in an aggregation operation in accord with minimal recalculation methods of the present invention;

[0043] FIG. 19 is a flowchart of a "process" function of a minimal recalculation engine in accord with the present invention;

[0044] FIGS. 20A-B are a flowchart of a "recalc" function of a minimal recalculation engine in accord with the present invention;

[0045] FIG. 21 is a flowchart of a "fuse" function of a minimal recalculation engine in accord with the present invention;

[0046] FIGS. 22A-B are a flowchart of a "link" function of a minimal recalculation engine in accord with the present invention;

[0047] FIG. 23 is a flowchart of a "loop" function of a minimal recalculation engine in accord with the present invention;

[0048] FIG. 24 is a flowchart showing the application of field rules in a parallelized manner, in accord with the present invention; and

[0049] FIG. 25 shows fields of a table and backup fields of a table, used in a checkpointing operation in accord with the present invention.

[0050] In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

DETAILED DESCRIPTION OF THE INVENTION

[0051] In brief overview, embodiments of the present invention provide a database system that is suitable for on-line transaction processing (OLTP) and traditional on-line analytical processing (OLAP). Further embodiments can provide real-time OLAP for instantaneous decision support. The architecture of the database system enables updating of OLAP requests in real-time with data received from newly-processed transactions. This capability is realized, in part, through data vectorization and the efficient use of indexing to accelerate calculations and database operations.

[0052] FIG. 1 presents an embodiment of a database system in accord with the present invention. The database system includes a database module 100
in direct or indirect communication with a user interface module 104. The database module 100 provides both traditional database and data warehouse functionality. The database module 100 may also provide tools for viewing and editing metadata associated with the data contained in the database module 100 or elsewhere. The user interface module 104 serves as an access point for privileged and non-privileged users to interact with the database system. Embodiments of the user interface module 104 range from character-driven terminal systems to thin-client graphical user interface (GUI) systems, such as web browsers and other client software applications.

[0053] Depending on the functionality desired from the database system, the system may optionally include one or more of a data generator module 108, an OLAP services module 112, a transaction processor module 116, a batch processor module 120, and an external interface module 124. The data generator module 108 is useful for system diagnostics. For example, the data generator module 108 may generate large, complex data sets to test the scalability of a database system and provide heuristic feedback to the user enabling fine-grained tuning of the application. The OLAP services module 112, transaction processor module 116, and batch processor module 120 interact with the database module 100 to provide support for traditional OLAP, OLTP, and real-time OLAP.

[0054] The external interface module 124 defines a set of application programming interfaces (APIs) for third-parties to use when developing software that interacts with the database system. The database system may also communicate with one or more external data sources 128, including both queued and non-queued data sources. Exemplary queued data sources include those provided by Tibco, MQ Series, and Tandem. Exemplary nonqueued data sources include those provided by DB2, Oracle and SQL Server, and flat data files.

[0055] The components of the database system may be implemented as individual software processes in a single-processor or multi-processor computing environment. Particular components may be especially suited to implementation on specialized hardware, such as a network of workstations (NOW), as discussed in greater detail below. It is understood that these components, when present, may be directly or indirectly connected in various configurations in accord with the teachings of the present invention. Accordingly, the interconnections illustrated in FIG. 1 are exemplary and are not to be viewed as limiting the scope of the invention, as set forth in the claims.

[0056] FIG. 2 illustrates one embodiment of the database module 100. The database module includes a database 200 and a metadata repository 204 in communication with parser 208 and metadata utilities 212. The database 200 stores data, indices, logs, and static metadata for use by the other modules in the system. The metadata repository 204 contains metadata associated with the data in database 200. Typical metadata includes, for example, defined keys, defined rules, and the dependencies between various fields in a pending query (such as sources, targets and mappings). In one embodiment, the metadata repository 208 is stored in an SQL server or other relational database, and can be accessed using standard SQL queries. In other embodiments, the metadata repository 208
is stored in a vectorized database, such as that described below. Processes which provide system functionality can be controlled by variables stored in metadata repository 204 whose values are set by a user via a user interface.

[0057] Database 200 stores the data used by the system, as well as the indices and logs that are generated by the system for its internal use. Like a table in a traditional relational database, each table or "relation" stored in the database 200 is composed of columns and rows. Each row entry in the table contains related data, also called a record, and each column in the table specifies an attribute of the records. Table data is stored column-wise, that is, as vectors with data fields stored linearly as one large block of contiguous data (e.g., in a file). To facilitate vectorization, the length of the columns in the table is uniform, as is the data type of the entries in a particular column.

[0058] This column-wise storage of data has numerous advantages. First, since the data in a column is stored contiguously, it can be read into memory quickly, without seek time associated with reading data scattered across a disk. Second, since each data element in a column is of a fixed size, it is relatively simple, for example, to determine the location of the data associated with a particular record as an offset from the start of the file containing the column. Likewise, it is straightforward to read selected portions of a column from disk into memory because the structure of the data stored on the disk is the structure of the data as it is to be stored in memory. Thus, there is no need to allocate a buffer memory or perform any transformation between reading the data from the disk and storing the data in memory. The column-wise storage of data also facilitates random accesses into the data, since the data stored on disk can be traversed directly without any intermediary loading or transformation into a memory.

[0059] In one embodiment, the system maintains range indices for each key field column and index column stored in the database. For each fixed-size "chunk" of a column (or index), the range indices contain maximum and minimum values of the data in that chunk. This information can be used to increase the efficiency of certain operations, such as table joins, searches, or identifying minimum or maximum values, without requiring significant additional storage relative to the size of a column.

[0060] For example, suppose that a column had 700 million entries, and that the chunk size for the column is 10,000 entries. The column would have a range index with 70,000 entries (which is small compared to the 700 million entries of the column), with each entry storing the minimum and maximum values in that chunk of the column. When a search is performed against the table looking for a particular value in that column, the desired value can first be compared to the minimum and maximum values for each chunk, and chunks for which the desired value does not fall within the range defined by the minimum and maximum values need not be searched, read, or accessed in any way.

[0061] The range indices may be updated as data are inserted, updated, and deleted from the column. When the data are inserted, updated, or deleted during transaction processing, updates to range indices are stored as edits in logs for the range indices. The use of logs to support changes to data and indices is also discussed below.

[0062] To reduce the user effort required to implement vector operations, in one embodiment the database system includes support for a functional programming language. In this embodiment, the system is designed to accept database queries in the language and, using a system of tailored data structures and overloaded operators, translate the queries into efficient vector operations that are suitable for parallel processing. Other embodiments of the present invention may support traditional query languages such as SQL. In another embodiment, the entire system is written in the functional programming language, while in still other embodiments the database system is written in a conventional programming language such as C, C++, or C#.

[0063] Since the language is column oriented, the computations for each column can be specified and performed independently. In contrast, languages like SQL require that complex queries be created to deal with separate aggregations over many columns. For example, the following functional language specification:

1
Declare Maps (ConsolidatedPL.TickerSymbol, Positions.TickerSymbol; ConsolidatedPL. TraderID, Positions.TraderID) ConsolidatedPL.TickerSymbol = Positions.TickerSymbol ConsolidatedPL.TraderID = Positions.TraderID ConsolidatedPL.IndustryName = Securities.IndustryName ConsolidatedPL.MarketValue = Positions.Holding * Securities.MarketPrice ConsolidatedPL.TotalPLb- yTrader = FINDSUMVAL ((Positions.Holding * (Securities.MarketPrice - Positions.PurchasePrice) + Settled.RealizedPL), ConsolidatedPL.TraderID) ConsolidatedPL.TotalPLbyIndustry = FINDSUMVAL ((Positions.Holding * (Securities.MarketPrice - Positions.PurchasePrice) + Settled.RealizedPL), ConsolidatedPL.IndustryName) is equivalent to the following SQL code: WITH PL AS (SELECT P.TraderID AS Trader, PTickerSymbol AS TickerSymbol, S.IndustryName AS Industry, S.MarketPrice * P.Holding AS MarketValue, P.Holding * (S.MarketPrice - P.PurchasePrice) AS UnrealizedGain, T.RealizedGain AS RealizedGain) FROM (Positions.P FULL OUTER JOIN Settled.T ON P.TraderID = T.TraderID AND P.TickerSymbol = T.TickerSymbol) JOIN Securities.S ON P.TickerSymbol = S.TickerSymbol), ByTrader AS (SELECT Trader AS Trader, SUM(UnrealizedGain + RealizedGain) AS TotalPLbyTrader FROM PL GROUP BY Trader), ByIndustry AS (SELECT Industry AS Industry, SUM(UnrealizedGain + RealizedGain) AS TotalPLbyIndustry FROM PL GROUP BY Industry), SELECT PL.Trader AS Trader,ID PL.TickerSymbol AS TickerSymbol, PL.MarketValue AS MarketValue, ByTrader.TotalPLbyTrader AS TotalPLbyTrader, ByIndustry.TotalPLbyIndustry AS TotalPLbyIndustry FROM PL, ByTrader, ByIndustry WHERE PL.Trader = ByTrader.Trader PL.Industry = ByIndustry.Industry

[0064] In one embodiment, the parser 208 accepts queries and transforms the queries into directed acyclic graphs, where each node in the graph represents a data structure and each line in the graph represents an operation on its connected data structures. At compile time, the parser 208 can identify and eliminate circular dependencies to construct the graphs and improved JOIN strategies. In one embodiment, the parser 208 is itself written in the functional programming language.

[0065] The metadata utilities 212 provide a convenient point of access to the metadata contained in the metadata repository 204. The utilities, in various embodiments, provide different functionality. For example, one utility 212.sup.1 exports metadata to computer-aided software engineering (CASE) tools. Another utility 212.sup.2 draws graphs for a user using the metadata in the repository 204 to visually depict the metadata.

[0066] One embodiment of the user interface module 104 is presented in FIG. 3. In this embodiment, the user interface module 104 includes an application server 300 in communication with one or more executable programs, such as OLAP GUI 304, system GUI 308, or web server 312. FIG. 3A illustrates another embodiment of the user interface module 104 where the application server 300 is a gateway server and the gateway server, the OLAP GUI 304 and System GUI 308 are implemented in Java and packaged as Java archive (.JAR) files.

[0067] In operation, an end user interacts with the database system using a web browser (not shown). The browser connects with the web server 312, which provides the web browser with one or more files written in hypertext markup language (HTML). The files typically include a reference to a location storing a program providing the functionality of OLAP GUI module 308 or Modeller GUI 308. The user's computer downloads the program and executes it, permitting the user to manipulate the graphical elements of the program. Manipulating the graphical elements sends messages to the application server 300 which in turn passes them to the database system for processing. The application server 300 may also receive completed results from the database server and provide them to the GUI 304, 308. In one embodiment, the application server 300 also includes configuration data, such as the location of the metadata repository 204.

[0068] FIG. 4 illustrates one embodiment of the data generator 108
including a data generation server 400. In one embodiment, the data generation server 400 generates test data sets based on a database model described in the metadata repository 204. The data generation server 400
may accept inputs specifying a desired range of values and overlap and then generate a data set satisfying the specified values. With a test data set of sufficient size, it is possible to explore the scalability of the database model and perform quality assurance tests.

[0069] FIG. 5 illustrates one embodiment of the external interface module 124, which to facilitate interoperability with third-party software products. In one embodiment, the external interface module includes an open database connectivity (ODBC) services module 500 and/or a common-object model (COM) services module 504. The ODBC module 500
exposes distributed SQL and SQL ODBC interfaces to third-party products such as MICROSOFT EXCEL and MICROSOFT ACCESS. The COM services module 504
exposes a COM interface for use with third-party software products that support COM.

[0070] Distributed Database Queries

[0071] In another embodiment, the ODBC services module 500 communicates with a distributed database system (not shown) for the evaluation of SQL queries over a star schema. The distributed database system may be implemented on a massively-parallel computer or a network of workstations having separate processors and memory. A master process separates the tables in the database 200 into smaller data tables for distribution and storage on individual computer nodes. The master process also reduces database queries into subqueries on subsets of the tables and routes the appropriate subquery to the node having the relevant table subset. The node computes a partial result and returns it to the master process for aggregation and final assembly of the results, e.g., by processing the "group by," "having," and "order by" clauses of the SQL query.

[0072] In one embodiment, the database 200 is organized in a star schema with very large fact tables and relatively smaller lookup tables. The number of fact tables may be restricted to one fact table and, likewise, a restriction may be implemented prohibiting the joining of fact tables for the purpose of allowing specific optimizations. In this embodiment, a lookup table may, for example, be delegated to a computer node for processing.

[0073] The data in the tables in database 200 may be vertically partitioned, i.e., vectorized, facilitating the parallelization of database query processing as described above. For example, many queries result in the examination of all of the rows of a table and a subset of the table's columns when each column represents a field and each row represents an entry in the table. The data in the fact tables may also be horizontally partitioned into sizes that are appropriate to the processing power of the individual computer nodes. In one embodiment, a user may manually specify a minimum or maximum size for a partitioned table, or specify a load factor that expresses the computational capability, and therefore the appropriate chunk size, of the computing node relative to the other nodes in the network.

[0074] In one embodiment where the distributed database system is implemented on a network of workstations, the system includes three sets of functionality: a mechanism for controlling the processes executing on individual workstations in the network, a mechanism for distributing the data contained in fact tables to individual workstations, and a query engine.

[0075] The processes on the individual workstations are controlled through a universal interface. Each workstation in the network runs a process that can identify the available disk space or CPU time on the workstation, load new or replacement software, and start or stop individual software programs. In a further embodiment, the process is fault-tolerant such that in the event of a system failure or restart, the process itself automatically restarts and resumes the execution of the software that was active prior to the failure.

[0076] The distribution of data is controlled by one or more data distribution servers. The data distribution server listens for updates from a data source--e.g., a streaming source, packages the updates into a group, selects a target workstation to receive the update, publishes the update, waits for a response, and selects another workstation if the first one fails. Each workstation executes a data server that listens for updates published by the data distribution servers; the data servers make updates to the local, partial copies of the fact tables. In another embodiment, the data distribution server publishes updates to one or more backup locations and tracks where updates have been distributed.

[0077] Queries to the distributed database system are handled by a query master process that may be replicated on one or more workstations to achieve fault tolerance. The query master breaks a query into subqueries, identifies the appropriate target workstations to execute the subqueries, transmits the subqueries to the query server processes on these machines, waits for responses from the query servers, and assembles the partial results into full results. The query servers upon receiving the subqueries, pass the subqueries to query engine processes that perform the actual execution of the subqueries. In one embodiment, where each workstation is a multiprocessor computer, the query engine process is adapted to utilize one or more of the processors in the workstation. The query engines provide their results to the query server, which forwards the results to the query master. In some cases, the query master may assemble the results by a simple union of the partial results, while in other cases, the results may be assembled by combining the subqueries.

[0078] In another embodiment, the distributed database system also provides a interface to the user for controlling the distribution of software to various workstations in the network, and for starting and stopping the execution of software on those machines.

[0079] Referring to FIG. 6, in one embodiment the OLAP services module 112
includes an OLAP server 600 in communication with a data mining module 604 and a navigator module 608. The data mining module 604 supports complex queries intended to uncover information hidden in the data. The navigator module 608 permits a user to perform a "drill-down" inspection on the data contained in database 200.

[0080] Referring to FIG. 7, in one embodiment the batch processor module 120 includes a batch manager 700 in communication with a batch metadata extractor 704, a heterogeneous transaction manager propagator 708, one or more source manager(s), generically 712, and one or more target manager(s), generically 716.

[0081] The source managers 712 receive data from one or more sources and converts the data to an internal database format called a delta or d-table, also discussed below. In one embodiment, each source manager 712
is adapted to receive data from a particular data source.

[0082] The batch metadata extractor 704 retrieves computational rules from the metadata repository 204 for processing by the batch manager 704. The batch manager receives the rules and the delta tables and applies them to construct relations of derived values called gammas or g-tables in a heterogeneous, vectorized format.

[0083] Transaction Processing Subsystem

[0084] Referring to FIG. 8, the structure of an embodiment of a transaction processing subsystem in accordance with an embodiment of the present invention is described. Transaction processing subsystem 800
includes various processes including the transaction manager 802, which is generally the first of these processes to be started when transaction processing begins. Transaction manager 802 is responsible for starting all other required processes in transaction subsystem 800, for managing the state of transaction processing subsystem 800 (described below), and for assigning work to the individual processes of the transaction subsystem.

[0085] Adapter manager 806 receives information on transaction sources from transaction manager 802, and sets up an adapter listener 808 for each such source. When adapter listeners 808 are started, adapter manager 806 informs transaction manager 802 that the adapter listeners 808 are ready.

[0086] Each adapter listener 808 receives data from a queued transaction data source, such as Tibco, MQ Series, Tandem, Aleri IPC, or other data sources, databases, or systems that are capable of providing transaction information. Each adapter listener receives data from one or more of these sources, converts the data into the form that is used to represent transactions within transaction subsystem 800, forms blocks of the transaction data, and forwards the blocks of transaction data to a resource manager 804. In various embodiments, it is possible for a listener 808 to receive data from multiple sources, and for multiple listeners 808 to be use to receive data from multiple sources.

[0087] The size of the blocks of transaction data that are sent from an adapter listener 808 to a resource manager 804 can be an adjustable parameter that is "tuned" according to hardware and software configuration. An adapter listener 808 may collect transactions until a particular predetermined number of transactions are collected, and then send a block of transaction data to resource manager 804 and/or an adapter listener 808 may send all the transactions that are collected during a specified period of time. For example, an adapter listener could be set to send transactions when it has collected 200 transactions or when one second has elapsed since the last time that is sent a block of transactions, whichever occurs first.

[0088] Each resource manager 804 is responsible for collecting transactions and for logging collected transaction data for one source table of "transaction" type. Transaction manager 802 starts one resource manager 804 for each such transaction source table. Generally, each resource manager receives blocks of transactions from an adapter listener 808, synchronously stores the transactions to disk, and sends notification that the blocks have been stored to transaction manager 802, along with information on where the blocks of transactions can be retrieved for further processing. Once this information has been sent, resource manager 804 can continue its operation in a non-blocking manner.

[0089] Each minimal recalculation engine 810 is responsible for taking a set of transactions, typically in the form of edits to tables, and "running" at least a portion of a model on the edits. The transaction manager 802 may start one or more minimal recalculation engines 810 to run simultaneously. The number of minimal recalculation engines 810 that may be started at once typically is a configurable parameter of transaction manager 802. Transaction manager 802 manages these minimal recalculation engines 810 so that calculation of a model is parallelized at the table level. Each minimal recalculation engine 810 operates on indices and logs to recalculate portions of a model that are affected by the transactions, rather than recalculating the entire model. Each minimal recalculation engine 810 notifies the transaction manager 802
when it is finished with intermediate results, so that the transaction manager 802 may determine which portions of the model can be computed given the available sources, intermediate results, and system resources, and send those portions to one or more minimal recalculation engines 810.

[0090] Internal checkpoint manager 812 is responsible for the non-blocking two-phase auto-commit type check point functionality required by minimal recalculation engines 810. Typically, an internal checkpoint manager 812
and one or more slave processes are started by transaction manager 802
when transaction subsystem 800 is started.

[0091] External checkpoint manager 814 is responsible for publishing changes to selected targets, and to other applications and processes. Typically, one external checkpoint manager 814 is started by transaction manager 802 when transaction subsystem 800 is started.

[0092] System Operation

[0093] In operation, database transactions may result in the vertical partitioning of individual columns from the database 200 into vector data structures that are themselves stored as individual files in the database 200. These vectors may be further subdivided and farmed out to individual processors or workstations for processing. Vectorization facilitates parallel processing because typically the same operation will be performed on every data entry in the vector. Moreover, a vectorized or subdivided data set is more likely to fit in the high-speed cache memory of a processor, reducing the need for external memory accesses, and more likely to fit in the main memory of a computer system, reducing the need for disk accesses that can slow down processing.

[0094] Batch Processor Operation

[0095] Referring to FIG. 9, using the System GUI, a traditional text-based interface, or other equivalent interfaces, a user may specify a data model and a set of computational rules accompanying the model (Step 900). The model sets out such elements as source tables, target tables, table keys, mappings between tables, and rules. The system uses this specification to construct a series of operations, e.g., joins and aggregations, that can be used to transform the input tables into the target model (Step 904). Having established the set of operations, the system executes them and provides the results to the user. In certain embodiments, the results are updated in real-time with the results of newly-processed transactions (Step 908).

[0096] The system associates every table with a unique table identifier, i.e., an index such as a numerical value, by automatically constructing an analyzing a graph structure. Similarly, each field in each table is associated with a unique field identifier, i.e., a second index.

[0097] The relations that serve as inputs are referred to as d-tables or deltas. A d-table typically includes one or more fields derived from an external source, such as a Sybase table, a text file, or a vectorized table native to the system. In some embodiments, one or more of the fields in the d-table may be computed fields.

[0098] Intermediary, derived tables are referred to as g-tables or gammas. The fields in a g-table are derived from the rules and one or more tables that the user has previously specified. The input tables I that form a particular g-table G, i.e., the set of relations I(G), are the tables whose fields are referenced in the computational rules which define G. For example, if G has one field which is computed from fields contained in tables E and F, then it is true that I(G)={E, F}. The combination of d-tables, g-tables, and tables derived from g-tables automatically define an internal graph structure.

[0099] Every table T has one or more key fields K(T) that uniquely identify the records in T. For example, assume a table of employees T1:

2
SSN Name Department Salary 100-01-0000 Bob Sales 100
020-02-9999 Bob Development 200
007-87-1523 Ken Marketing 150
976-81-1829 Joan Sales 300
172-67-9163 Brenda Development 80
658-17-8743 Dennis Development 300

[0100] In this example, one possible K(T1) is the employee's social security number. Similarly, assuming that no person works for more than one department, a second possible K(T1) is the pair of columns (Name, Department). The database system maintains a key table in database 200
containing the keys for all of the tables used in the graph.

[0101] The input tables I(G) for a particular gamma are related by a mappings between the individual input tables. Referring to the employee table T1, assume a table T2 identifying the managers of the departments specified in T1:

3
Department Name Sales Joan Development Dennis

[0102] and a table identifying the locations of each department:

4
Department City Sales Paris Development Santa Fe Research Sausalito

[0103] The user may define a mapping: Map(Employee.Department, Manager.Department). This indicates that the entries in these particular columns are drawn from the same set of data. The Map operator is reflexive, i.e., Map(Manager.Department, Employee.Department) is equivalent to Map(Employee.Department, Manager.Department). The Map operator is also transitive, in that defining another Map(Employee.Department, Location.Department) permits the system to infer that Map(Manager.Department, Location.Department). Formally speaking, a mapping M exists between two fields F and G if there is a path H[1], . . . , H[n] with F=H[1], H[n]=G, and either n=1 and F=G, or n>1 and for each i between 1 and (n-1), either Map(H[i], H[i+1]) or Map(H[i+1], H[i]) is a mapping declared by the user.

[0104] As mentioned above, the user may also specify computational rules to, e.g., create column entries in a gamma. For example, to derive a g-table T3 containing each employees' manager, a user may define Map(EmployeeManager.SSN, Employee.SSN), indicating that the entries in the SSN column in the EmployeeManager relation must be drawn from the entries in the SSN column of the Employee relation. Then, the user could specify computational rules, such as:

[0105] EmployeeManager.SSN :=Employee.SSN

[0106] EmployeeManager.Name :=Employee.Name

[0107] EmployeeManager.ManagerName :=Manager.Name

[0108] When these mappings and rules are compiled, the system attempts to use various types of table joins (e.g. Fuse, Link, and Loop joins, described below) to combine tables as necessary to enable the calculation of the specified rules. For example, the result of the example rules specified above would be:

5
SSN Name ManagerName 100-01-0000 Bob Joan 020-02-9999 Bob Dennis 007-87-1523 Ken <null> 976-81-1829 Joan Joan 172-67-9163 Brenda Dennis 658-17-8743
Dennis Dennis

[0109] The system stores a table map in database 200 that contains all of the mappings between all of the tables in the graph and has the structure:

6
t u M(t,u) M(u,t) - - ------- ------- . . . . . . . . . . . . T U f, . . ., g h, . . . , i . . . . . . . . . . . .

[0110] Referring to FIG. 10, the processing of each g-table G occurs in six steps. First, all of the input tables I(G) are combined into a single table S(G) (Step 1000). Next, S(G) is exploded into a single table ES(G) (Step 1004). Duplicate fields in ES(G) are discarded to form table UES(G) and a synonym table Z(G) is constructed from ES(G) (Step 1008). UES(G) is partitioned into PUES(G) (Step 1012), and G is constructed by applying the computational rules to PUES(G) and Z(G) (Step 1016). Lastly, the selection proposition is applied to G (Step 1020).

[0111] Given the input tables I(G), the step of combining I(G) into S(G) (Step 1000) can be achieved by first applying fuse, link, and loop operations according to the algorithm described below.

[0112] Having completed the combination step (Step 1000), the resulting table S(G) contains all of the fields in the input tables I(G), which may result include one or more duplicate fields. Duplicate fields will be duplicated when the S(G) is Exploded (Step 1004), resulting in an exploded table ES(G) containing fields with identical values.

[0113] The Explode operator replicates the rows of a table to accommodate the rearrangement of the rows of a subset of the table fields into a single column. Typically, the Explode operator includes the following steps:

[0114] a. The columns on which explode is done get eliminated.

[0115] b. The rows get replicated as many times as the number of columns on which the explosion is done.

[0116] c. Two new Columns get added to the result: "Field Name", "Field Value". (The user may choose different column names).

[0117] d. The "Field Name" columns become the additional field of the key.

[0118] All the values in the columns on which the Explode is done are being put into the "Field Value" column so that the values in the "Field Value" columns correspond the respective column names in the "Field Name" column, based on which columns the values are taken from. (During this operation all the values, which go into the "Field Value" column get converted to character, if they are of different data type, otherwise the datatype is preserved). The following example illustrates the Explode operator (key fields are marked by "*"):

7
K1* K2* F1 F2 F3 F4
A 1 New York 10
10.sup.th Avenue 555-55-55
A 2 Chicago 20 Shore Drive 222-56-78
A 3 New York 30 5.sup.th Avenue 555-12-12
B 1 Paris 10 Place De La Concorde 123-45-67
B 2 London 11 Trafalgar Square 999-99-99
B 3 Moscow 12 Red Square 444-44-44
C 0 St. Petersburg 13
Palace Square 111-11-11

[0119] Assume that the table is exploded on the fields F2, F3, and F4, producing the following result:

8
K1* K1* Field Name* Field Vlaue F1
A 1 F2
10 New York A 1 F3 10.sup.th Avenue New York A 1 F4
555-55-55 New York A 2 F2 20 Chicago A 2 F3 Shore Drive Chicago A 2 F4 222-56-78 Chicago A 3 F2 30 New York A 3 F3 5.sup.th Avenue New York A 3 F4 555-12-12 New York B 1 F2 10 Paris B 1 F3 Place De La Concorde Paris B 1 F4
123-45-67 Paris B 2 F2 11 London B 2 F3 Trafalgar Square London B 2 F4 999-99-99 London B 3 F2 12 Moscow B 3
F3 Red Square Moscow B 3 F4 444-44-44 Moscow C 0 F2 13 St. Petersburg C 0 F3 Palace Square St. Petersburg C 0 F4
111-11-11 St. Petersburg

[0120] The Implode operator is the opposite of the Explode operator. The same example above could be used to illustrate an Implode, except the source of Implode would be the second table. The Implode operator includes the following steps:

[0121] a. One key column is chosen as the fields names source.

[0122] b. One column is chosen as a fields values source.

[0123] c. As a result of elimination of the two columns, the remaining table should have rows duplicated, as many times as the number of unique values in the first column, otherwise implosion is impossible. Producing a set of distinct rows eliminates the duplication.

[0124] d. The new key is the previous key without column in (1).

[0125] e. As many columns as the number of distinct values in the column in (1) gets added to the result of (3), the names of the columns being the values of the column in (1).

[0126] f. The values of the columns are populated from the column in (2), so that the correspondence between the columns names and the values is the same as the correspondence between the values in the columns (1) and (2), row-wise, and the correspondence between the remaining key and the values in the column (2) is preserved.

[0127] The next step in the processing of the g-table is to construct a dictionary Z(G) and a reduced table UES(G) (Step 1008). Equivalence classes of fields are formed, for example, (f, . . . ,g), (h, . . . ,i), etc. The first field in the group--here f or h--is selected to represent all of the fields in the group. The remaining fields in each group are deleted from ES(G), leaving only the representative fields in the result table UES(G).

[0128] The dictionary Z(G) is constructed with variables whose names match the original fields of ES(G): Z.f:'f, . . . , Z.g:'f, Z.h:'h, . . . , Z.i:'h, etc. The value of Z.x is the name of the representative field of the group to which X belongs. The dictionary Z(G) improves performance by reducing the partitioning of multiple copies of the same data and is used for logical reasons to ensure that the name of every field in every input table I(G) is recognized as it appears in the rules for G.

[0129] After construction of the dictionary, the reduced table is partitioned (Step 1012) into relation PUES(G). Using the table map for I(G) and G, one can determine whether K(G) is also a key for UES(G). If K(G) is not a key for UES(G), we can select the partitioning to force K(G) to be a key for UES(G) by partitioning UES(G) on K(G), converting the remaining fields of UES(G) into lists of vectors.

[0130] The next step in table processing defines each field in G by evaluating the rule defining that field over PUES(G) (Step 1016). Since PUES(G) is a partitioned table, the execution of the rules results in aggregated fields, i.e., vector fields, and not fields which are lists of vectors. In one embodiment, if the execution of a rule results in an atomic value, then that value is reshaped to a vector with a length equal to the cardinality of G whose entries are the atomic value, repeated. In another embodiment, if the execution of a rule results in a value which is a list of vectors, then a run-time "depth" error is signalled. In yet another embodiment, this latter error is detected at compile time by a semantic analyzer contained in system GUI 308.

[0131] First, the rules for K(G) (the keys of G) are evaluated. If the resulting table contains empty records, then these empty records are removed from both G and PUES(G). Then the rules for the remaining tables are ordered based on dependency relations among the fields they define. For example, if a value G.gross is defined to be G.price*G.quantity, then G.quantity and G.price are defined before G.gross is computed. Dependency cycles in the rules are not allowed. Lastly, the selection proposition Y is applied to G (Step 1020), keeping only those records in G which satisfy Y. The result is saved in the database 200, with each file corresponding to an individual vector, and can be used as input to additional g-tables.

[0132] FIG. 11 is a flowchart illustrating the algorithm for computing the underlying expressions representing the combined relation specified by the user. First, the source relations are determined (Step 1100). Next, a series of steps is determined for computing a relationship between tables (Step 1104). Lastly, an algorithm is iteratively applied to the source relations to build an expression using the fuse, link, and loop operators (Step 1108).

[0133] Given a table computed by rules, the source relations can be determined (Step 1100) by examining the rules and identifying all of columns used by the rules. For example, in a rule using the dotted identifier Employee.SSN the column is "SSN" and "Employee" is the relation. More precisely, suppose the rules for a particular derived relation T have the form:

[0134] T.f1 :=<expression 1>

[0135] T.fn :=<expression n>

[0136] and the fields occurring in <expression 1>, . . . , <expression n> are S1.g1, . . . , Sm.gm. Then, the collection of source relations is the set of distinct identifiers in the collection S1, . . . , Sm.

[0137] Having identified the source relations, it is now possible to deduce, for any pair of relations, one of four possible relationships: "1to1", "1toM", "Mto1", and "MtoM" denoting, respectively, one-to-one, one-to-many, many-to-one, and many-to-many relationships (Step 1104). First a collection of sets is initialized. Given the complete set of mappings between all relations (not only source relations):

[0138] M={Map (s1.f1, t1.g1), . . . , Map(sn.fn, tn.gn)}

[0139] It is possible to identify the set of sets of column E using the following algorithm. First, initialize E to be the set {{s.f} .vertline.s.f is a column in one of the tables}. Thus, E is a set of singleton sets, each having one field. Also, set N to M.

[0140] Next, assume that the first element in N is Map(e1, e2), where e1
and e2 are columns. Then, let K1 be the element of E such that e1 is an element of K1; and K2 be the element of E such that e2 is an element of K2. Set E=E-{K1, K2} union {K1 union K2} and N=N-{Map(e1,e2)}. Afterwards, if N is empty, then stop the process. Otherwise, process the next element in N.

[0141] As an example, consider the previously-defined Employee, Manager, and Location relations with the mappings:

[0142] Map (Employee. Department, Manager. Department)

[0143] Map (Manager. Department, Location. Department)

[0144] To calculate the collection E of sets, E is initialized to:

[0145] E={{Employee.SSN}, {Employee.Name}, {Employee.Department}, {Employee.Salary}, {Manager.Department}, {Manager.Name}, {Location.Department}, {Location.City}}

[0146] In light of the first Map, E becomes:

[0147] E={{Employee.SSN}, {Employee.Name}, {Employee.Department, Manager.Department}, {Employee.Salary}, {Manager.Name}, {Location.Department}, {Location.City}}

[0148] and after the second Map, E becomes

[0149] E={{Employee.SSN}, {Employee.Name}, {Employee.Department, Manager.Department, Location.Department}, {Employee.Salary}, {Manager.Name}, {Location.City}}

[0150] Thus, the algorithm forces Employee.Department and Location.Department to be related because they end up in the same set or equivalence class.

[0151] A fuse is a one-to-one join, also called a "full outer join" in relational databases. The operation in expressions is "Fuse". For instance, the expression:

[0152] Fuse (Manager, Location)

[0153] represents the full outer join of the Manager and Location relations, where the join is based on the key columns Manager.Department and Location.Department. Given the data above, this relation is:

9
Department Name City Sales Joan Paris Development Dennis Santa Fe Research <null> Sausalito

[0154] There may be more than two relations in a Fuse. The key field of a Fuse is the key field of any of the relations; by convention, we set it to be the key field of the last relation in the Fuse expression. Thus, the key field for the Fuse above is the Location.Department column.

[0155] A link is a one-to-many join. The operation in expressions is "Link". There may be more than one relation that is linked "one" to the "many" relation. The "many" relation is called the driver relation, and appears last in the Link expression. For instance, in the expression:

[0156] Link (Manager, Employee)

[0157] the driver relation is Employee, which is in a many-to-one relation with Manager (each row in Manager matches possibly many rows in Employee, using the key columns). The relation computed by this link is, given the above data:

10
SSN Name Salary Manager.Name 100-01-0000 Bob 100 Joan 020-02-9999 Bob 200 Dennis 007-87-1523 Ken 150 <null> 976-81-1829 Joan 300 Joan 172-67-9163 Brenda 80 Dennis 658-17-8743 Dennis 300 Dennis

[0158] Similarly, the expression:

[0159] Link (Manager, Location, Employee)

[0160] specifies Employee as the driver relation, and returns a relation with one more column (the City column). Another similar expression is:

[0161] Link (Fuse (Manager, Location), Employee)

[0162] where again Employee is the driver relation. The key field of a Link is the key field of the driver relation.

[0163] Finally, a loop is a many-to-many join. The operation in expressions is "Loop". Loop expressions involve only two relations, e.g., Loop (Relation1, Relation2). The key field of a Loop is the union of the key fields of the two constituent relations, with duplicate mapped keys removed from this set.

[0164] The subroutine for deducing the relationship between S and R, for any relations S and R, uses this set E={E[1], . . . , E[m]} using the following steps. First, the key fields of S and R are identified. For example, assume that they are either declared or computed to be S.a1, . . . ,S.am and R.b1, . . . ,R.bp. Using these key fields, determine the sets K and L such that:

[0165] K={u.vertline.if S.Bv is in E[u] for some v and u, then some column of R is also in E[u]}

[0166] L={u.vertline.if R.bv is in E[u] for some v and u, then some column of S is also in E[u]}

[0167] After this computation, if K=L, then the relation between the source relations is "1to1," i.e., a one-to-one relationship. If L is a proper subset of K, then the relation is a "1toM", i.e., a one-to-many, relationship. If K is a proper subset of L, then the relation is a "Mto1", i.e., a many-to-one relationship. If K and L are incomparable as sets, then the relation is a "MtoM", i.e., a many-to-many relationship.

[0168] Again, for example, consider the set E constructed above and assume that it is desirable to deduce the relationship between the relations Employee and Manager. The key field of Employee is Employee.SSN and the key field of Manager is Manager.Department. Following the process outlined above, Employee.SSN is in the first set of E and Manager.Department is in the third set of E. Thus K={ } and L={3} and K is a proper subset of L, so there is a "Mto1" relationship between Employee and Manager.

[0169] For another example, assume that it is desired to deduce the relationship between Manager and Location. The key fields are Manager.Department and Location.Department. Thus, K={3} and L={3}, so there is a "1to1" relationship between Manager and Location.

[0170] Once the relations are deduced (Step 1104), an iterative algorithm is applied to the source relations to build an expression using the fuse, link, and loop operators (Step 1108). The algorithm is illustrated in FIG. 12 and begins with a set C={S1, . . . , Sm} of distinct identifiers (Step 1200). Without loss of generality, the identifiers can be assumed to be distinct since they are contained in a set.

[0171] First, assuming that C={T1, . . . , Tk}, then set D=C and E to be the empty set. Defining the first element in D to be V, then solve for all relations in D.times.{V} that are "1to1" related to V, using the algorithm in Step 1104 (Step 1204). If there are no such relations, set E=E union {V} and D=D.times.{V}. Assuming that there are "1to1" relations U1, . . . , Ur, then set E=E union {Fuse(U1, . . . , Ur, V)} and D=D-{U1, . . . , Ur, V}. If D is not the empty set, solve for the "1to1" relations for the next entry in D, otherwise set C to E, and proceed to the next step.

[0172] Again assuming that C={T1, . . . , Tk}, then set D=C and E to the empty set. Define the first element in D to be V. Now, find all relations in D-{V} that are "1toM" related to V, using the algorithm of Step 1104
(Step 1208). If there are no such relations, set E=E union {V} and D=D-{V}. For each "1toM" relationship U1, . . . , Ur, set D=D-{U1, . . . , Ur, V} union {Link(U1, . . . , Ur, V)} union E and E to the empty set. If D is not the empty set, then solve for the "1 toM" relations for the next element in V, otherwise set C to E and proceed to the next step.

[0173] Again assuming that C={T1, . . . , Tk}, then set D=C and E to the empty set. Define the first element in D to be V. Now, find all relations in D-{V} that are "MtoM" related to V, using the algorithm of Step 1104
(Step 1212). If there are no such relations, set E=E union {V} and D=D-{V}. For each "MtoM" relationship U1, . . . , Ur, set D=D-{U, V} union {Loop(U, V)} union E and E to the empty set. If D is not the empty set, then solve for the "MtoM" relations for the next element in V, otherwise set C to E and proceed to the next step.

[0174] If, after solving for the "1to1", "1toM", and "MtoM" relations, the set C has not changed or has only one entry, then stop, otherwise the algorithm repeats (Step 1216). If at the end C has more than one element, an error of "not enough mappings" is returned as the result; otherwise, the one element of C is returned.

[0175] Other embodiments of this algorithm detect anomalous or erroneous situations. For example, in one embodiment if the user defines a Map between a non-key in one relation and a non-key in another relation, an "exaggerated mapping error" is reported. In another embodiment, if the relationship between the combined relation and the target relation is "MtoM" or "1toM", as computed using the algorithm in Step 1104, then an error is reported. In still another embodiment, if the relationship between the combined relation and the target relation is "Mto1", then the Modeler sets the aggregation flag for the target relation. Also, if any of the rules for the key columns in the target relation are not simple field expressions-that is, the rules do not have the form

[0176] t.f :=s.g

[0177] where t is the target relation, f is the column in the target relation, s is the source relation, and g is the column in the source relation, the Modeler sets the aggregation flag for the target relation. This kind of rule is called a "key transformation."

[0178] The aggregation flag implies further checking. If the aggregation flag is set, the computational rules for the target relation are checked. These rules must have enough "aggregation" on non-key fields in order to ensure that the target relation can be constructed.

[0179] Operation of Transaction Processing

[0180] The rules just described can be used throughout the system, and are particularly useful in enabling transaction processing. By predetermining the relationships between the tables desired for use in queries, for example, the transaction processing subsystem can calculate the table changes required by transactions as transactions come into the system, even if the changes are to be reflected in the arbitrarily complex table relationships. This is demonstrated by the operation of transaction processing.

[0181] During transaction processing, a series of transactions affect the source tables (also called d-tables or delta tables), making changes to the data in those tables. When propagated through the model, these changes to the source tables may cause changes in the target tables. The database system of the present invention is capable of quickly handling most transactions, so that analysis based on a combination of historical data combined with incoming transactions is possible.

[0182] As discussed hereinabove, in operation, one or more adapter listeners 808 receive incoming transactions from a variety of sources. These transactions are collected into batches, and sent on to a resource manager 804, which stores the batches of transactions, and sends the batches of transactions on to transaction manager 802.

[0183] Transaction manager 802 implements a state machine that keeps track of the batches of transactions that are received, groups the batches of transactions for processing, and sends the batches of transactions to one or more minimal recalculation engines 810 for processing. The status of the processing of batches of transactions is tracked, so transaction manager 802 can determine when to send further batches of transactions to minimal recalculation engine(s) 810, and so that transaction processing can be resumed if an interruption occurs.

[0184] Referring now to FIGS. 13A-13B, an example of the state information kept by transaction manager 802 is described. Each row in state table 1302 includes a batch ID field 1304, that contains a unique batch ID of each batch of transactions to be processed, a NumRecs field 1306, that contains the number of transaction records contained in a particular batch of transactions, a state field 1308, that keeps track of the state of each batch of transactions, and a group ID field 1310, that keeps track of which group of batches of transactions are to be processed at once. In addition to these fields one embodiment of the present invention includes timing fields (not shown) with the state information. The timing fields may include fields for each group of batches of transactions that track the amount elapsed time since the group was formed, the amount of time spent calculating, and the amount of time spent checkpointing. This timing information may optionally be used for fine-grain tuning of the system, and for optimization heuristics on directed graphs.

[0185] An entry in state field 1308 may contain the value "done", which indicates that the batch of transactions has been processed, "proc", which indicates that the batch of transactions is being processed, "wait", which indicates that the batch of transactions is waiting to be processed, "checkpoint1", which indicates that the batch of transactions is in a phase of checkpointing in which a rename operation is occurring, or "checkpoint2", which indicates that the batch of transactions is in a non-blocking phase of checkpointing, in which logs are applied to a backup or "shadow" copy of the tables and indices.

[0186] The checkpointing will be described in detail hereinbelow, with reference to FIG. 25. During the phase of checkpointing during which the rename operation is being performed ("checkpoint"), the minimal recalculation engine(s) 810 are unable to start processing a new group of batches of transactions. When a group of batches of transactions reaches the "checkpoint2" state, the minimal recalculation engine(s) 810 can start processing a new group of batches of transactions.

[0187] In the example state table shown in FIG. 13A, the transaction batches with IDs 1 and 2 are "done", the transactions batches with IDs 3, 4 and 5 are in the "checkpoint1" state, and the transaction batches with IDs 6, 7 and 8 are waiting to be processed. This state information is used by transaction manager 802 during its operation, and may assist in recovering the state of transaction manager 802 if transaction processing is interrupted.

[0188] The entries in group ID field 1310 are used by transaction processor 802 to identify groups of batches to be processed at once. In this example, because the group ID is 2 for the batches of transactions with batch IDs 3, 4, and 5, these batches of transactions were sent to a minimal recalculation engine 810 as one large group of transactions.

[0189] In FIG. 13B, state table 1302 is shown a short time later. Now, the transactions batches with IDs 3, 4 and 5 are in the "checkpoint2" state, which is non-blocking. This permits transaction manager 802 to assign minimal recalculation engine(s) 810 to process the next group of batches of transactions. Accordingly, the batches with batch IDs 6, 7 and 8, all of which are processed as a group (with group ID 3), are now being processed, and have value of "proc" in state field 1308.

[0190] Referring to FIG. 14, an example dependency graph of tables is shown. It will be understood that this example graph is an illustrative example, and that the actual dependency graphs handled by the system generally are more complex.

[0191] In the example, table T3 1402 is dependent on table T1 1404. Table T17 1406 is dependent on table T2 1408. Table T22 1410 is dependent on table T3 1402, and T19 1412 is dependent on tables T3 1402 and T17 1406. Due to these dependencies, transaction manager 802 can assign minimal recalculation engines to work on transactions affecting tables T1 1404
and T2 1408 in parallel, since neither depends on the other. No other tables may be computed until at least one of these tables has finished, even if more than two minimal recalculation engines 810 are available.

[0192] Once T1 1404 is finished, T3 1402 can be started in parallel with other minimal recalculation tasks that are being handled. T19 1412 cannot be started until both T3 1402 and T17 1406 have completed. If there are more tables ready to be processed than there are minimal recalculation engines 810 available to process them, the tables that are ready may be assigned arbitrarily to minimal recalculation engines 810. Alternatively, techniques such as use of weighted graphs can be used to place an order on the assignment of tables that are ready to minimal recalculation engines 810, to provide greater throughput.

[0193] The dependencies which are used to order and parallelize computation are available in the metadata, and are based on the mappings that were submitted by the user. Such parallelization, based on the availability of needed data (i.e. data flow) is also possible at the field level.

[0194] When the changes to a target table have been determined my minimal recalculation engine(s) 810, those changes may be applied in a non-blocking manner by internal checkpoint manager 812, and published to applications that "subscribe" to the target tables by external checkpoint manager 814.

[0195] Operation of the Minimal Recalculation Engine

[0196] Minimal recalculation engine 810 handles recalculation of target tables that are specified in the model when transactions change the source tables on which the target tables depend. The minimal recalculation engine make use of the table rules that determine the manner in which source tables are joined to produce target table, and the application of field rules that compute the values of fields based on the values of other fields.

[0197] As described above, relations between tables can be one-to-one, one-to-many, or many-to-many. Each of these relations may be represented by creating a combined table from tables in which the related fields are located. The operation used in a table rule to combine tables depends on the nature of the relation. For example, for a one-to-one relation, a fuse operation (also called a full outer join) is used to combine tables. For a one-to-many relation, a link operation (also called a left or right outer join) is used, while for a many-to-many relation a loop operation is used.

[0198] The combined table discussed above could be virtual (i.e., represented through sets of indices showing which row in each table involved in a table rule contributes to which row of the combined table). There is one index per table, the length of which is the number of rows in the combined table. Each entry in the index corresponds to a row of the combined table, and contains the number of the row of the table it represents that contributed to that row of the combined table. Alternatively, if the memory model permits, the combined table may be physically produced.

[0199] As transactions enter the system, the transactions will affect the values in tables, which in turn, will affect the values in, and possibly the dimensions of, other tables that are defined in the model to depend from the modified tables, for example as the result of fuse, link, and loop operations or other functions. The tables that are subject to these operations may be very large. In addition, models of even moderate complexity may involve application of table rules that call for numerous fuse, link, and loop operations to be applied to join the various source tables into a target table. If each transaction that alters a source table resulted in complete recalculation of all tables that are defined in the model to depend on tables modified by the transaction, it would not be possible to handle transactions very quickly. To provide the rapid updates that are required for decision support based on incoming transactions, it may be necessary, for example, to handle hundreds or thousands of transactions per second.

[0200] Minimal recalculation engine 810, performs only a relatively small number of calculations for each received transaction. Minimal recalculation engine 810 uses indices, and logs of changes to the indices to propagate transaction changes made to a "source" table using a complex table rule in a model without performing a complete recalculation. For each transaction that changes a source table, a relatively small number of changes may be needed to indices that represent combined tables resulting from fuse, link, and loop operations called for in the model. Minimal recalculation engine 810 computes the changes to the indices and to the combined table (rather than recomputing the entire tables), and creates logs of these changes, which may be applied (asynchronously) by the checkpoint managers 812 of transaction processing subsystem 800.

[0201] As is shown in the following examples, the fuse, link, and loop operations of minimal recalculation engine 810 create logs of changes to indices that represent the combined tables, rather than recreating the actual combined tables. This reduces the required computation, since it is more efficient to add an entry to a log of an index than to build an entire table, particularly when the tables contain a large number (e.g. millions) of entries. In addition, since application of a rule in a model may not require the entire combined table in order to generate the changes to its target that result from a transaction changing a source table, the minimal recalculation engine can use only the necessary portions of a source or combined table that are needed to compute changes to target tables.

[0202] In the following examples, transactions will generally be referred to as "edits" to a source table. The edits are found in a log for the source table, and can indicate that inserts, updates, or deletes should be applied to the source table. As described above, in one embodiment, the edits are reviewed by a listener and provided to the minimal recalculation engine 810 by the resource manager 804. In general, an edit comprises an operation (e.g. insert, update, or delete), an index, which indicates the position (counting from 0) in the table or index at which the edit is applied, and zero or more data fields, indicating the data that is to be inserted or updated at the position indicated by the index. An insert edit (denoted with an "i" in logs shown in examples) indicates that an item should be inserted into a table at a position specified by the index. An update edit (denoted with a "u" in the logs shown in the examples) indicates that one or more values of the data fields of a row indicated by the index are to be changed. A delete edit (denoted with a "d" in the logs shown in the examples) indicates that the row specified in the index is to be deleted. In response to these edits to a source table, minimal recalculation engine 810 generates edits to the indices associated with the source tables, and the combined tables.

[0203] In one embodiment, delete edits are transformed into update edits. A "physical" delete, that actually removes the row from the table, is transformed into two updates. The first update moves the data fields of the last row of the table to the position of the row that is being deleted. The second edit changes the data fields of the last row of the table to "null" values. Such null rows can easily be truncated from the end of the table. This approach to deleting rows ensures that null rows from deletes accumulate at the bottoms of all the tables and indices, permitting a very inexpensive truncation operation to be used to remove the null entries at the ends of the tables. Moving the last row of the table into the position of the deleted entry prevents the system from having to perform the potentially expensive operation of moving all rows below the deleted entry up by one row.

[0204] A second type of delete, referred to as a "logical" delete may also be used. A "logical" delete is transformed into a single update, that changes the values all of the non-key fields of the deleted row to "null". Thus, using a "logical" delete, the deleted row remains in the table, but its non-key data are removed. Alternatively, a "logical" delete may be handled by the system as if it were an update that changes the values of all non-key fields in the row to "null", without actually transforming the operation into an update.

[0205] In one embodiment of the database system, the type of delete that is used is an adjustable parameter, permitting users to specify whether deletes are to be handled as "physical" deletes or as "logical" deletes.

[0206] The order in which edits are processed may also vary. In one embodiment of the system, edits have different priorities. For example, deletes could be given the highest priority, so all deletes are handled first. Updates have the second highest priority, so all updates are handled next. Inserts have the lowest priority, so they are handled last. These priorities may vary, and may be specified in the metadata. If the metadata does not specify priorities, then all operations may be given equal priority, so that edits will be handled in the order that they are received.

[0207] In an embodiment that uses such priorities, edits can be handled differently, or even ignored based on their priority, and on the other edits that are in a log. The following procedure may optionally be applied to handle prioritized edits. Note that this procedure assumes that delete edits are handled as "logical" deletes, and are directly processed by the system, rather than first being transformed. First, a search is performed for each new edit, based on its key. If the edit is not found in the table, the edit is transformed into an insert. Note that this generally only affects updates for which a key is not found. If the key is found, the priority for the edit is compared to the priority of the match that was found (assuming that the match was found in the logs). If the priority is greater than or equal to the match, then the edit is applied without a change (though a "duplicate key" warning may be issued in the case of insert and delete edits). If the priority of the edit is less than the priority of the matched element, then the edit is ignored.

[0208] This optional procedure may be useful in cases where the order of edits received by the system is not the same as the order in which the transactions occurred. By using priorities in conjunction with the procedure provided above, a reasonable order may be given to the edits, and edits that will have no effect (e.g., because the row they modify has been deleted) may be ignored. This priority-based handling of edits may optionally be performed early in the process of handling transactions, such as in an adapter listener 808, resource manager 804, or in other processes in transaction handling subsystem 100.

[0209] The following examples demonstrate how edits coming into the system as a result of transactions are propagated through the operations that form table rules (i.e. the fuse, link, and loop operations). In accordance with the present invention, applying an edit to a source table that is transformed by a join operation (i.e. a fuse, link, or loop) causes a set of zero or more edits to be added to logs associated with the indices that represent a combined table that results from such a join operation, and to a log associated with the combined table.

[0210] Referring to FIGS. 15A-15C, an example of a fuse operation with updates due to transactions in accordance with the present invention is described. FIG. 15A shows the starting state of tables T1 (1500) and T2
(1502), and of combined table TC (1504), which is the result of applying a fuse operation to T1 (1500) and T2 (1502). It should be noted that table TC (1504) is shown here for illustrative purposes only, and need not actually be constructed, due to the use of T1 INDEX (1506) and T2
INDEX (1508), which represent the combined table TC (1504). TC (1504) is a "virtual" table, in the sense that it does not actually exist in memory.

[0211] All of table T1 (1500), table T2 (1502), index T1 INDEX (1506), and index T2 INDEX (1508) as shown in FIG. 15A typically would have been generated during batch processing or other initialization stages, and would already exist at the time that transaction processing begins. Use of minimal recalculation engine 810 permits application of relationships specified in a model to perform analysis of historical data, such as is shown in FIG. 15A, in combination with incoming transaction data, as will be described below.

[0212] As can be seen in FIG. 15A, the fuse operation takes two tables (in this case, T1 (1500) and T2 (1502), which are mapped one-to-one on a key field in each table (in this case, key field K1 (1510) in T1 (1500) and key field K2 (1512) in T2 (1502)), and creates a single table whose fields are the union of the fields of the two tables (in this case, combined table TC (1504)). As mentioned, the combined table TC can be virtual, and exist only in the form of the indices T1 INDEX 1506 and T2
INDEX 1508. These indices that represent the combined tables indicate which rows from the source tables map to the rows of the combined table. For example, T1 INDEX (1506) contains {0 1 2 - - }. This indicates that row 0 (counting from 0) of the combined table TC (1504) contains values from the fields of row 0 of T1 (1500), row 1 of the combined table contains values from the fields of row 1 of T1, row 2 of the combined table contains values from the fields of row 2 of the combined table, row 3 of the combined table does not correspond to a row in T1, and row 4 of the combined table does not correspond to a row in the combined table. Similarly, T2 INDEX (1508), which contains {0 - - 2 1 3}, indicates that row 0 of the combined table TC (1504) contains values from the fields of row 0 of T2, row 1 of the combined table does not correspond to a row of T2 (since T2 does not contain the key "B"), row 2 of the combined table contains values from the fields of row 2 of T2, row 3 of the combined table contains values from the fields of row 1 of T2, and row 4 of the combined table contains values from the fields of row 3 of T2.

[0213] Referring now to FIG. 15B, the results of performing a transaction that affects one of the source tables of the fuse operation are described. In FIG. 15B, an example transaction 1520 calls for an insert, which is a new row to be inserted into table T1 (1500). This insert is represented by an entry 1520 in T1-LOG 1522. Entry 1520 is an edit having an operation-type "i", for insert, index 3, indicating that the item is to be inserted as row 3 of table T1 (i.e. after the current last row, counting from 0), a key value K1 of "D", and other values associated with table T1, as listed in entry 1520.

[0214] Insertion of a new record (or row) into T1 will affect the combined table formed by the fused combination of T1 and T2. Therefore, to process the edit in T1-LOG 1522, minimal recalculation engine 810 will generate T1 INDEX-LOG 1524, to contain edits to T1 INDEX 1506, T2 INDEX-LOG 1526,-to contain edits to T2 INDEX 1508, and COMBINED LOG 1528, to contain edits to the combined table TC 1504.

[0215] Since the key K1 in entry 1520 is "D", which is not already present in table T2 1502, insertion of entry 1520 into T1 1500 will cause the addition of a new row in the combined table, which means that insertion of entry 1520 requires insertion of a new row into both T1 INDEX 1506 and T2 INDEX 1508. Minimal recalculation engine 810 determines the required inserts to the indices, and adds them to T1 INDEX-LOG 1524, and T2 INDEX LOG 1526, respectively. As can be seen, in this example, an insert edit at row 5 with a value 3 is added to T1 INDEX-LOG 1524, indicating that T1
INDEX 1506 should have a 3 inserted in (new) row 5 (i.e. row 5 of the combined table takes values from row 3 of T1). An insert edit at row 5 of null ("-") is added to T2 INDEX-LOG 1526 (i.e. row 5 of the combined table does not correspond to a row of T2, since T2 does not contain the key "D"). Minimal recalculation engine 810 also generates an entry in COMBINED LOG 1528, indicating that a new row 5 should be inserted, with values as shown.

[0216] Although combined table TC 1504 does not actually exist in memory, being represented by T1 INDEX 1506 and T2 INDEX 1508, since relatively few operations are required to generate the entry in COMBINED LOG 1528, as an optimization, COMBINED LOG 1528 is actually generated, even though it could be generated from T1 LOG 1522, T1 INDEX-LOG 1524, and T2-INDEX LOG 1526. Additionally, it should be noted that in actual operation, the edits in the combined table log will typically contain only the operation and index, until an entire table rule (which may include numerous fuse, link, and loop operations) is applied, after which the data fields will be added to the final combined log. The data fields are shown here for illustrative purposes.

[0217] In general, an insert of a row that has a completely new key (i.e. a key that is new across all the tables that are being fused), such as is shown in the example of FIG. 15B, causes inserts to be added to each of the table index logs, and an insert to be added to the combined table log. An insert of a row having a key that is present in one of the other tables that is being fused causes an update to be added to the index log of the table being edited, and an update to be added to the combined table log.

[0218] Referring now to FIG. 15C, a different transaction that causes changes to the original data shown in FIG. 15A is described. In FIG. 15C, entry 1540 of T1-LOG 1542 indicates that row 1 (counting from 0) of T1
1500 should be updated to change the value of F2. This edit will not require any change to T1 INDEX 1506 or T2 INDEX 1508, since the keys are not altered. The only change that must propagate through the fuse operation is an update to the data in the combined table to reflect the change in the data in table T1 1500. Therefore, minimal recalculation engine 810 need only add an entry to COMBINED LOG 1544 updating the values in the appropriate row of the combined table.

[0219] In general, an update of a non-key in any table that is being fused will cause an update to be added to the combined table log. Updates to the keys of tables that are being fused are not allowed, except when they are caused by a delete, as described below.

[0220] In FIG. 15D, the results of a transaction that causes the deletion of a row from one of the source tables of a fuse operation is shown. Entry 1560 in T1-LOG 1562 indicates that row 1 (counting from 0) of table T1 1500 should be deleted. In this example, a "physical" delete is being used, so the delete is handled by transforming the delete edit into two update edits. The first update moves the data from the last row of the table in which a delete is occurring to the row that is being deleted. The second update changes the last row of the table to contain all "null" values. Such null entries of the table may be truncated from the end of the table. Accordingly, as shown in FIG. 15D, entry 1560 is transformed into entries 1564 and 1566 in updated T1-LOG 1568. These two update edits are then carried out, adding edits to T1 INDEX 1570, T2 INDEX 1572, and COMBINED LOG 1574, as shown.

[0221] Note that an update to a key occurs as a result of transforming the delete into two edits. As described above, such an update would not normally be permitted. An exception is made in the case of an update resulting from a delete edit. Generally, an update to a key in a fuse operation resulting from a delete will require updates to the indices of the tables being fused, and to the combined log.

[0222] FIGS. 16A-16B show a link operation, which is used to join tables when a one-to-many relation is present. In FIG. 16A, the starting state of tables T1 (1600) and T2 (1602), and of combined table TC (1604), which is the result of applying a link operation to T1 (1600) and T2 (1602) is shown. As before, table TC (1604) is shown for illustrative purposes only, and need not actually be constructed in full, due to the use of T1
INDEX (1606) and T2 INDEX (1608), which represent the combined table TC (1604). Also as before, all of table T1 (1600), table T2 (1602), index T1
INDEX (1606), and index T2 INDEX (1608) as shown in FIG. 16A would have been generated during batch processing, and would already exist at the time that transaction processing begins.

[0223] As can be seen in FIG. 16A, the a link operation takes two tables (in this case, T1 (1600) and T2 (1602), where a key (field K2 1610 of table T2 1602 in this example) of one of the tables is mapped one-to-many to a link field (field F1 1612 in this example) of the other table. The result is a table in which the records of the "one" table (T2 1602 in this example) have been replicated to match those of the "many" table (T1
1600 in this example). Thus, in the combined table TC 1604, each row in which F1 field 1612 matches K2 field 1610, the values corresponding to the row containing that value of K2 are replicated into combined table TC 1604. As in the previous example, the indices T1 INDEX 1606 and T2 INDEX 1608, that represent the combined table indicate which rows from the source tables map to the rows of the combined table.

[0224] Referring now to FIG. 16B, the results of performing an update edit and an insert edit that affect one of the tables being joined in the link operation are described. Entry 1620 in T1-LOG 1622 is an edit that updates row 1 (counting from 0) of table T1 1600. Because the value of the link field F1 1612 would be changed by this operation, the values in T2 INDEX 1608 may need to be updated. The needed update to T2 INDEX 1608
as a result of this update edit are added as entry 1624 in T2 INDEX-LOG 1628, indicating that row 1 of T2 INDEX 1608 should be updated to the value 1 (i.e. row 1 of the combined table will contain values from row 1
of T2). It is also necessary to add entry 1630 to COMBINED LOG 1634, indicating the changes to row 1 of combined table TC 1604 as a result of the update.

[0225] In general, an update of a link field in the "many" table causes updates to be added to the index logs of all the "one" tables to contain the appropriate matching row values. Additionally, it causes an update to be added to the combined table log. An update to a non-key field of the "one" table causes updates to be added to the combined table log for each position in the combined table that corresponds to a position in the index of the one table that contains the row number of the row that was updated. Updates to key fields of the "one" table are typically not permitted, unless they occur due to a delete operation. Such updates are a special case, in which minimal recalculation may not be the fastest approach.

[0226] Entry 1621 of T1-LOG 1622 is an insert edit that would add a new row to table T1 1600. This insert causes addition of an insert edit as entry 1636 in T1 INDEX-LOG 1638, and an insert edit as entry 1626 of T2
INDEX-LOG 1628. Additionally the insert transaction causes entry 1632, an insert edit, to be added to COMBINED LOG 1634.

[0227] In general, an insert of a row containing a link field value into the "many" table causes inserts to be added to the index logs of all of the "one" tables, an insert to be added to the index log of the "many" table, and an insert to be added to the combined log. An insert of a new row into a "one" table may cause numerous updates to be added to the index log of that table, and the same number of updates to be added to the combined log, or nothing. This is because the new key fields may match the link fields which did not have any matching key fields before, and thus had nulls in the index vector of the "one" table wherever that link field appeared in the "many" table. On the other hand, the new key fields may not have any matches in the "many" table, in which case nothing needs to be done.

[0228] FIGS. 17A-17C illustrate a loop operation, which is used to combine tables when a many-to-many relation is present. In FIG. 17A, the starting state of tables T1 (1700) and T2 (1702), and of combined table TC (1704), which is the result of applying a loop operation to T1 (1700) and T2
(1702) is shown. As before, table TC (1704) is shown for illustrative purposes only, and need not actually be constructed in full, due to the use of T1 INDEX (1706) and T2 INDEX (1708), which represent the combined table TC (1704). Also as before, all of table T1 (1700), table T2 (1702), index T1 INDEX (1706), and index T2 INDEX (1708) as shown in FIG. 16A would have been generated during batch processing, and would already exist at the time that transaction processing begins.

[0229] As can be seen in FIG. 17A, the loop operation takes two tables (in this case, T1 (1700) and T2 (1702), where a first link field (field F1
1710 of table T1 1700 in this example) of one of the tables is mapped many-to-many to a second link field (field F4 1712 in this example) of the other table. The result is a table having a row for each combination of matching values of the first and second link fields in the two tables. Thus, in the combined table TC 1704, for each row in which a value of F1
field 1710 matches at least one value of F4 field 1712, the combined table TC 1704 will have a row combining the values of rows from T1 1700
and T2 1702 for each element in F4 field 1712 that matches the value of F1 field 1710. As in the previous examples, the indices T1 INDEX 1706 and T2 INDEX 1708, that represent the combined table indicate which rows from the source tables map to the rows of the combined table.

[0230] Referring now to FIG. 17B, the result of performing an insert edit that affects one of the tables being joined in the loop operation is described. Entry 1720 in T2-LOG 1722 is an edit that inserts a new row at the end of table T2 1702. Because this adds an additional value to field F4 1712 in T2 1702 that corresponds to two values in field F1 1710 in T1
1700, two new rows will be added to the combined table. This is reflected in the addition of two insert edits in T1 INDEX-LOG 1724, two insert edits in T2 INDEX-LOG 1726, and two insert edits in COMBINED LOG 1728.

[0231] For edits on many-to-many table operations, an insert of a row containing new or existing link field values into one of the tables, such that the new values are not present in at least one of the other tables involved in the many-to-many relation causes no effect. An insert of a row containing a new or existing link field value into one of the tables, such that the link field value is present in other tables involved in the many-to-many relation causes a number of inserts to be added to the index logs of each of the tables, and to the combined table log, equal to the product of the number of occurrences of the values in the other tables.

[0232] In FIG. 17C, the result of performing an update edit that affects one of the source tables of the loop operation is described. Entry 1730
in T1-LOG 1732 is an edit that updates a non-link field of table T1 1700. Because this only changes the values of fields in the combined table, but not the positions in table T1 1700 from which those values are taken, the update has no effect on T1 INDEX 1706 or T2 INDEX 1708. Accordingly, there are no additions to T1 INDEX-LOG 1734 or T2 INDEX-LOG 1736. The update will require an update to the combined table everyplace where the entry that is being updated occurs in the combined table. Since the value being updated occurs in combined table TC 1704 at indices 0 and 2, updates are added to the combined table log with indices 0 and 2, with the updated value.

[0233] In general, an update value to a non-link field in a loop operation causes as many updates to be added to the combined table log as there are occurrences of the row that was updated in the combined table. If the updated row is not present in the combined table, no updates will be added to the combined table log. Note that an update to a link field in a loop operation is a special case that could lead to many edits being added to the logs of the table indices and the combined table. In general, such updates should only be permitted if they result from a delete edit.

[0234] Referring now to FIGS. 18A-18B, an example of an aggregation operation is shown. Such aggregation operations are used to represent many-to-one relations, and occur only after the table rule has been applied, to convert a combined table (which results from application of various fuse, link, and loop operations) into the target table. Unlike the fuse, link, and loop operations which were described above, aggregation operations are used to compute values for fields in a target table. When transactions affect the table on which the aggregation operation occurs, these changes must be propagated through the aggregation operation, to affect the values of the target fields.

[0235] In FIG. 18A, the starting state of tables T1 1800 and the target table TARGET 1802 are shown. It should be noted that in this example, T1
is a combined table that may have resulted from operation of the join rules discussed above, such as fuse, link, and loop. In this example, the field F1 1806 in table T1 1800 is mapped many-to-one to the field K 1808
in the table TARGET 1802. To perform this mapping, all of the values in field K1 1804 that have the same values in field F1 1806 must be combined to provide a value for field F 1810 in table TARGET 1802. In this example, a SUM operation is used to combine the values from field K1
1804. Thus, in the row of table TARGET 1802 having a value in field K 1808 of "A", the value of field F 1810 will be the sum of all of the elements in field K1 180