United States Patent Application20050022115
Kind CodeA1
Baumgartner, Roberts ; et al.January 27, 2005

Visual and interactive wrapper generation, automated information extraction from web pages, and translation into xml
Abstract
A method and a system for information extraction from Web pages formatted with markup languages such as HTML [8]. A method and system for interactively and visually describing information patterns of interest based on visualized sample Web pages [5,6,16-29]. A method and data structure for representing and storing these patterns [1]. A method and system for extracting information corresponding to a set of previously defined patterns from Web pages [2], and a method for transforming the extracted data into XML is described. Each pattern is defined via the (interactive) specification of one or more filters. Two or more filters for the same pattern contribute disjunctively to the pattern definition [3], that is, an actual pattern describes the set of all targets specified by any of its filters. A method and for extracting relevant elements from Web pages by interpreting and executing a previously defined wrapper program of the above form on an input Web page [9-14] and producing as output the extracted elements represented in a suitable data structure. A method and system for automatically translating said output into XML format by exploiting the hierarchical structure of the patterns and by using pattern names as XML tags is described.

Inventors:Baumgartner; Roberts (Vienna, IT), I'Lesca; Sergio  (Ronde, IT), Gottlob; Georg  (Vienna, IT), Herzoo; Marcus  (Vienna, AT)
Correspondence Name and Address:2100 PENNSYLVANIA AVENUE, N.W. SUITE 800
SUGHRUE MION, PLLC
WASHINGTON
DC
20037
US
Series Code:479039

Claims


What is claimed is:
1. A wrapper generation system comprising: a network including at least one example document and at least one production document; and a visual builder that is adapted to interactively generate a wrapper program by letting a user visually and interactively declare at least one desired property of example-elements to be extracted from the example document thereby creating user declarations; a program evaluator adapted to execute a wrapper program over the production document and to extract desired production elements from the production document and to translate the production elements into XML yielding an XML companion of the production document.

2. The wrapper generation system of claim 1, wherein the visual builder further includes: an extraction pattern builder adapted to provide a visual interface for a user to specify at least one desired pattern whose instances are to be extracted from the document.

3. The wrapper generation system of claim 2, wherein the visual builder further includes: an XML translation builder adapted to interactively generate an XML translation scheme based on user specifications that specifies how to translate at least one pattern into XML using XML translation rules.

4. The wrapper generation system of claim 1, wherein the program evaluator further includes: an extractor that extracts data from the production document and provides an XML document.

5. The wrapper generation system of claim 2, further including: an XML translator that performs actual mapping from a set of pattern instances to an XML document.

6. The wrapper generation system of claim 1, wherein the example document is received in a browser window.

7. The wrapper generation system of claim 1, wherein a user controls wrapper generation, selects examples from the example document and adds further user specifications.

8. The wrapper generation system of claim 7, wherein the pattern is generated from conditions that are obtained from the examples and the user specification.

9. The wrapper generation system of claim 2 wherein the specification of a pattern is organized according to a logical combination of conditions.

10. The wrapper generation system of claim 9, wherein on applying the wrapper to the production document a data structure representing all elements that match the pattern is produced.

11. The wrapper generation system of claim 10 wherein the wrapper is applied to a plurality of production documents.

12. The wrapper generation system of claim 10, wherein the data structure is further translated to at least one output data structure.

13. The wrapper generation system of claim 9 wherein at least one of said conditions is a location condition that is obtained by generalizing a location descriptor of a selected example-element.

14. The wrapper generation system of claim 9 wherein at least one of said conditions is a refinement condition that refines a set of instances of the pattern based on explicit user commands.

15. The wrapper generation system of claim 14 wherein the refinement condition is an internal condition expressing properties which the instances of the pattern must fulfill regardless of their context.

16. The wrapper generation system of claim 14 wherein the refinement condition is a contextual condition that expresses restrictions on the context in which instances of the pattern are allowed to appear.

17. The wrapper generation system of claim 14 wherein the refinement conditions is a range condition that expresses that among the pattern instances which satisfy all other imposed conditions of a given filter, only those of each context are to be considered valid, which occur in that context within specified position ranges with respect to document order, and

18. The wrapper generation system of claim 14 wherein the refinement condition is an auxiliary condition which further refines the set of instances.

19. The wrapper generation system of claim 1 wherein the example document is a Web page.

20. The wrapper generation system of claim 1 wherein at least one production document is a Web page.

21. The wrapper generation system of claim 1 wherein the example document is formatted in HTML and where at least one of said example-elements corresponds to a node or to a sequence of nodes of an HTML parsing-tree of the example-document in which said example-element occurs.

22. The wrapper generation system of claim 1 wherein the production documents are formatted in HTML and where at least one of said production-elements corresponds to a node or to a sequence of nodes of an HTML parsing-tree of the production document in which said example-element occurs.

23. The wrapper generation system of claim 1 wherein the example-documents is formatted in HTML and where at least one of said example-elements corresponds to an HTML element whose type is selected from a group consisting of: html, table, tr, th, tfoot, thead, col, colgroup, caption, td, p, br, div, blockquote, body, head, h1, h2, h3, h4, h5, h6, dl, dd, dt, ol, ul, li, dir, menu, form, input, select, option, address, center, pre, xmp, nobr, wbr, hr, img, b, i, font-size, font-color, underline, blink, a (anchors), href, tt, big, sup, sub, cite, code, strong, em, samp, area, map, script, and CSS styles applied to the content and structure.

24. The wrapper generation system of claim 1 wherein the production document is formatted in HTML and where at least one of said production-elements corresponds to an HTML element whose type is selected from a group consisting of: html, table, tr, th, tfoot, thead, col, colgroup, caption, td, p, br, div, blockquote, body, head, h1, h2, h3, h4, h5, h6, dl, dd, dt, ol, ul, li, dir, menu, form, input, select, option, address, center, pre, xmp, nobr, wbr, hr, img, b, i, font-size, font-color, underline, blink, a (anchors), href, tt, big, sup, sub, cite, code, strong, em, samp, area, map, script, and CSS styles applied to the content and structure.

25. The wrapper generation system of claim 13, wherein said generalized location descriptor is obtained from a corresponding location descriptor by syntactic generalization operations.

26. The wrapper generation system of claim 25, wherein said generalized location descriptor is obtained by inserting zero or more wildcards, by substituting wildcards for zero or more elements and by eliminating zero or more elements from said location descriptor.

27. The wrapper generation system of claim 13, wherein the example-document is tree-structured and wherein each location descriptor, called plain tree path, created for an example-element occurring in an example-document corresponds to the sequence of element-types on the path in the parsing tree of said example-document from the root to said example-element, and where the corresponding generalized location descriptor, called incompletely specified tree path, is obtained from said location descriptor by inserting zero or more wildcards, by substituting wildcards for zero or more elements and by eliminating zero or more elements from said location descriptor.

28. The wrapper generation system of claim 13, wherein at least one production-document is tree-structured and wherein each location descriptor, called plain tree path, created for an example-element occurring in an example-document corresponds to the sequence of element-types on the path in the parsing tree of said example-document from the root to said example-element, and where the corresponding generalized location descriptor, called incompletely specified tree path, is obtained from said location descriptor by inserting zero or more wildcards, by substituting wildcards for zero or more elements and by eliminating zero or more elements from said location descriptor.

29. The wrapper generation system of claim 13, wherein at least one production-document is in HTML format and wherein each location descriptor, called plain tree path, created for an example-element occurring in an example-document corresponds to the sequence of HTML tags on the path in the parsing tree of said example-document from the root to said example-element, and where the corresponding generalized location descriptor, called incompletely specified tree path, is obtained from said location descriptor by inserting zero or more wildcards, by substituting wildcards for zero or more HTML tags and by eliminating zero or more elements from said location descriptor.

30. The wrapper generation system of claim 13, wherein said generalized location descriptor is set to a default value that is obtained from said location descriptor by prefixing each element in the location descriptor with a wildcard.

31. The wrapper generation system of claim 29 wherein a wildcard prefixing an element occurring within the generalized location descriptor is interpreted as an arbitrary chain of elements whose type differs from a type corresponding to said element.

32. The wrapper generation system of claim 29 wherein a user is given a possibility to change and overwrite said default value.

33. The wrapper generation system of claim 2, wherein each of the pattern is selected from a set consisting of tree patterns, string patterns and document patterns, wherein instances of said tree patterns are subtrees of document parse-trees, instances of said string patterns are substrings of text strings occurring on some documents and instances of said document patterns are documents.

34. The wrapper generation system of claim 2, wherein the pattern description generated for the pattern contains a Boolean combination of all conditions that contribute to the definition of said pattern.

35. The wrapper generation system of claim 2, wherein the pattern description generated for the pattern consists of a set of rules formulated in a logic programming language.

36. The wrapper generation system of claim 2, wherein the pattern description generated for the pattern consists of a set of rules formulated in a logic programming language and where said wrapper consists of a logic program containing said rules.

37. The wrapper generation system of claim 36 wherein the variables and terms of said logic program range over elements occurring in the documents to which said wrapper is applied.

38. The wrapper generation system of claim 36 wherein the variables and terms of said logic program range over elements and strings occurring in the documents to which said wrapper is applied.

39. The wrapper generation system of claim 36 wherein the variables and terms of said logic program range over elements, element-lists, and strings occurring in the documents to which said wrapper is applied.

40. The wrapper generation system of claim 36 wherein said example document and production document are tree-structured and where the variables of said logic program range over nodes of the parsing trees of documents to which said wrapper is applied.

41. The wrapper generation system of claim 36 wherein said example document and production document are tree-structured and where the variables of said logic program range over nodes of the parsing trees and over strings occurring in the documents to which said wrapper is applied.

42. The wrapper generation system of claim 36 wherein said production document and example document are tree-structured and where the variables of said logic program range over nodes and sequences of nodes of the parsing trees of documents to which said wrapper is applied.

43. The wrapper generation system of claim 36 wherein said production document and example document are tree-structured and where the variables of said logic program range over nodes of the parsing tree of documents to which said wrapper is applied, over sequences of such nodes, and over strings occurring in documents to which said wrapper is applied.

44. The wrapper generation system of claim 36 wherein each condition that contributes to the definition of a pattern is represented by at least one atom in the body of a rule of said logic program which contributes to the definition of said pattern.

45. The wrapper generation system of claim 36 wherein each of said logic programming rules represents a filter consisting of a rule body that represents all conditions contributing to said filter in form of a conjunction of atoms and of a rule head whose predicate corresponds to the pattern said filter is associated with.

46. The wrapper generation system of claim 36 wherein said logic program is a datalog program making use of special predefined predicates for expressing conditions.

47. The wrapper generation system of claim 15 wherein the internal condition belongs to a set consisting of contains conditions and noncontains conditions wherein the contains conditions impose one or more restrictions on some subelement of the pattern to be defined and the notcontains conditions require that instances of the pattern to be defined do not contain any subelement that satisfies specified restrictions.

48. The wrapper generation system of claim 15 wherein an element of a production document or example document is either an ordinary element or a sequence of contiguous elements, and wherein each of said internal conditions belongs to a set consisting contains conditions, notcontains conditions, firstsubtree conditions and lastsubtree conditions; wherein contains conditions impose one or more restrictions on some subelement of the pattern to be defined, notcontains conditions require that instances of the pattern to be defined do not contain subelements satisfying specified restrictions, firstsubtree conditions require that some element with specified properties be the first element of a sequence of elements to be defined; and lastsubtree conditions requires that some element with specified properties be the last element of a sequence of elements to be defined.

49. The wrapper generation system of claim 48 where said restrictions are selected from a set consisting of restrictions on font type, restrictions on font size, restrictions on font color, restrictions on the text contained in said subelement, restrictions on hyperlinks and other anchors, restrictions on explicit positional parameters, restrictions on element types, and restrictions on the value of a hidden attribute.

50. The wrapper generation system of claim 16 where said contextual conditions are selected from a group consisting of before conditions, after conditions, notbefore conditions, notafter conditions and vertical positional conditions, wherein before conditions require that some element with specified properties must occur before a pattern instance, after conditions require that some element with specified properties must occur after a pattern instance, notbefore conditions require that some element with specified properties must not occur before a pattern instance, notafter conditions require that some element with specified properties must not occur after a pattern instance; and vertical positional conditions require that a pattern instance must occur in some vertical position relative to a specified element, where said vertical position is in a set comprising: below, above, not below, and not above.

51. The wrapper generation system of claim 1 wherein said example documents and said production documents are XML documents.

52. The wrapper generation system of claim 1 wherein said example documents and said production documents consist of unformatted text.

53. The wrapper generation system of claim 1 wherein said example documents and said production documents consist of text structured by structuring features from a set containing headlines, and titles, and spaces, and paragraphs, and indentations, and lists.

54. The wrapper generation system of claim 1 wherein said example documents and said production documents are formatted according to some document processing standard.

55. The wrapper generation system of claim 18 wherein said auxiliary conditions are selected from a group consisting of concept conditions, comparison conditions, pattern reference conditions, and conditions on documents, where concept conditions require that some part of a pattern to be defined matches an ontological concept definition, comparison conditions impose restrictions on the values of concept items, pattern reference conditions require that a pattern to be defined, or some element used within a pattern definition, be an instance of a pattern defined within the same wrapper; and conditions on document filters express restrictions on further documents to be loaded.

56. A method for visual and interactive generation of a wrapper for documents and for automated information extraction comprising: letting a user visually and interactively declare at least one desired property of elements to be extracted from the document thereby creating user declarations; translating the user declarations into a wrapper; executing a wrapper over the document; and extracting elements from said documents that match said user declarations.

57. A method for visual and interactive generation of wrappers for documents, and for automated information extraction comprising: defining extraction patterns on at least one example page, by visually and interactively selecting example-elements occurring on the example-page; visually and interactively declaring properties of the extraction patterns; generating a wrapper; applying the wrapper to at least one production document; and automatically extracting matching instances of the extraction patterns from the production documents.

58. The method of claim 57, wherein the example document is received in a browser window.

59. The method of claim 57, wherein a user controls wrapper generation, selects examples from the example document and adds further user specification.

60. The method of claim 59, wherein at least one pattern is generated from conditions that are obtained from the examples and the user specification.

61. The method of claim 57 wherein at least one extraction pattern is organized according to a logical combination of conditions.

62. The method of claim 57, wherein on applying to the document the wrapper produces a data structure representing all elements that match said pattern descriptions.

63. The method of claim 62 wherein the wrapper is applied to a plurality of production documents.

64. The method of claim 62, wherein the data structure produced by the application of the wrapper to the document is further translated to at least one output data structure.

65. The method of claim 61 wherein at least one of said conditions is a location condition that is obtained by generalizing a location descriptor of a selected example-element, and

66. The method of claim 61 wherein at least one of said conditions is a refinement condition that refines the set of instances of a pattern based on explicit user commands.

67. The method of claim 57 in wherein the processes of generation of a pattern further comprises: a) receiving from a user a pattern name and storing said name; b) creating and storing a filter for the pattern; c) visualizing the set of instances of the filter on at least one example document by evaluating the filter over the document and visualizing all data elements of the document that are matching instances, whereby a user can test the filter; d) modifying a previously created filter by adding to it refinement conditions that the instances of the filter must fulfill, where the refinement conditions are obtained from a user by receiving interactive commands from the user and where the refinement conditions are combined with those conditions for the filter that were added earlier; e) visualizing simultaneously all instances of all filters of the given pattern on at least one document by evaluating its corresponding pattern description against the document, whereby a user can test the pattern description constructed so far.

68. The method of claim 67, wherein the creation and storing of a filter comprises: b.1: selecting an example-element occurring in at least one example document by receiving selection commands from a user acting on the example document while the example document is displayed in a browser, and creating a location descriptor for said example-element with respect to said example-document; b.2: generalizing said location descriptor by replacing it with a generalized location descriptor that matches at least said example-element, and storing said generalized location descriptor in form of a location condition that can be evaluated over data elements; b.3: representing and storing the association between said pattern name and said generalized location descriptor, which together constitute a filter for said pattern and where data elements matching said generalized location descriptor are called instances of said filter;

69. The method of claim 67, wherein the steps are executed one or more times and the order of execution and number of repetitions is determined by interactive user actions.

70. The method of claim 67 wherein one of said refinement conditions is an internal condition expressing properties which the instances of a pattern must fulfill regardless of their context.

71. The method of claim 67 wherein one of said refinement conditions is a contextual condition that expresses restrictions on the context in which instances of a pattern are allowed to appear, and

72. The method of claim 67 wherein one of said refinement conditions is a range condition that expresses that among the pattern instances which satisfy all other imposed conditions of a given filter, only those of each context are to be considered valid, which occur in that context within specified position ranges with respect to document order, and

73. The method of claim 67 wherein one of said refinement conditions is an auxiliary condition which further refine the set of instances.

74. The method of claim 57 wherein at least one of said example-documents is a Web page.

75. The method of claim 57 wherein at least one of said production documents is a Web page.

76. The method of claim 57 wherein at least on of said example documents is formatted in HTML and where each of said example-elements corresponds to a node or to a sequence of nodes of an HTML parsing-tree of the example-document in which said example-element occurs.

77. The method of claim 57 wherein at least on of said production documents is formatted in HTML and where at least one of said example-elements corresponds to a node or to a sequence of nodes of an HTML parsing-tree of the production document in which said example-element occurs.

78. The method of claim 57 wherein at least one of said example-documents is formatted in HTML and where at least one of said example-elements corresponds to an HTML element whose type is selected from a group consisting of: html, table, tr, th, tfoot, thead, col, colgroup, caption, td, p, br, div, blockquote, body, head, h1, h2, h3, h4, h5, h6, dl, dd, dt, ol, ul, li, dir, menu, form, input, select, option, address, center, pre, xmp, nobr, wbr, hr, img, b, i, font-size, font-color, underline, blink, a (anchors), href, tt, big, sup, sub, cite, code, strong, em, samp, area, map, script, and CSS styles applied to the content and structure.

79. The method of claim 57 wherein at least one of said production documents are formatted in HTML and where at least one of said example-elements corresponds to an HTML element whose type is selected from a group consisting of: html, table, tr, th, tfoot, thead, col, colgroup, caption, td, p, br, div, blockquote, body, head, h1, h2, h3, h4, h5, h6, dl, dd, dt, ol, ul, li, dir, menu, form, input, select, option, address, center, pre, xmp, nobr, wbr, hr, img b, i, font-size, font-color, underline, blink, a (anchors), href, tt, big, sup, sub, cite, code, strong, em, samp, area, map, script, and CSS styles applied to the content and structure.

80. The method of claim 69, wherein said generalized location descriptor is obtained from a corresponding location descriptor by syntactic generalization operations.

81. The method of claim 80, wherein said generalized location descriptor is obtained by inserting zero or more wildcards, by substituting wildcards for zero or more elements and by eliminating zero or more elements from said location descriptor.

82. The method of claim 69, wherein at least one example-document is tree-structured and wherein each location descriptor, called plain tree path, created for an example-element occurring in an example-document corresponds to the sequence of element-types on the path in the parsing tree of said example-document from the root to said example-element, and where the corresponding generalized location descriptor, called incompletely specified tree path, is obtained from said location descriptor by inserting zero or more wildcards, by substituting wildcards for zero or more elements and by eliminating zero or more elements from said location descriptor.

83. The method of claim 69, wherein at least one production-document is tree-structured and wherein each location descriptor, called plain tree path, created for an example-element occurring in an example-document corresponds to the sequence of element-types on the path in the parsing tree of said example-document from the root to said example-element, and where the corresponding generalized location descriptor, called incompletely specified tree path, is obtained from said location descriptor by inserting zero or more wildcards, by substituting wildcards for zero or more elements and by eliminating zero or more elements from said location descriptor.

84. The method of claim 69, wherein at least one production-document is in HTML format and wherein each location descriptor, called plain tree path, created for an example-element occurring in an example-document corresponds to the sequence of HTML tags on the path in the parsing tree of said example-document from the root to said example-element, and where the corresponding generalized location descriptor, called incompletely specified tree path, is obtained from said location descriptor by inserting zero or more wildcards, by substituting wildcards for zero or more HTML tags and by eliminating zero or more elements from said location descriptor.

85. The method of claim 82, wherein said generalized location descriptor is set to a default value that is obtained from said location descriptor by prefixing each element in the location descriptor with a wildcard.

86. The method of claim 82 wherein a wildcard prefixing an element occurring within a generalized location descriptor is interpreted as an arbitrary chain of elements whose type differs from a type corresponding to said element.

87. The method of claim 85 wherein a user is given a possibility to change and overwrite said default value.

88. The method of claim 57, wherein commands related to said visual and interactively selecting said example-elements are given by a user acting directly on said example-documents, said acting is performed by means of one or more of a set of graphical user interface actions, said set including mouse-clicking and region marking.

89. The method of claim 57, wherein commands related to said visual and interactive declarations are given by a user acting by means of one or more of a set of user interface actions, said set including mouse-clicking, region marking on said example-documents, button selection, and insertion of data into dialog boxes.

90. The method of claim 57, wherein each of said patterns is selected from a set consisting of tree patterns, string patterns and document patterns, wherein instances of said tree patterns are subtrees of document parse-trees, instances of said string patterns are substrings of text strings occurring on some documents and instances of said document patterns are documents.

91. The method of claim 67, wherein said visualization of said instances of a filter and said simultaneous visualization of all instances of a pattern on selected documents is done by means at least one visual highlighting method, said visualizing method selected from a group consisting font color change, background color change, font type change, font size change, blinking, and underlining.

92. The method of claim 57, wherein the pattern description generated for each of said patterns contains a Boolean combination of all conditions that contribute to the definition of said pattern.

93. The method of claim 57, wherein the pattern description generated for each of said patterns consists of a set of rules formulated in a logic programming language.

94. The method of claim 57, wherein the pattern description generated for each of said patterns consists of a set of rules formulated in a logic programming language and where said wrapper consists of a logic program containing said rules.

95. The method of claim 94 wherein the variables and terms of said logic program range over elements occurring in the documents to which said wrapper is applied.

96. The method of claim 94 wherein the variables and terms of said logic program range over elements and strings occurring in the documents to which said wrapper is applied.

97. The method of claim 94 wherein the variables and terms of said logic program range over elements, element-lists, and strings occurring in the documents to which said wrapper is applied.

98. The method of claim 94 wherein the variables and terms of said logic program range over elements, element-lists, strings and attribute values occurring in the documents to which said wrapper is applied.

99. The method of claim 94 wherein said documents are tree-structured and where the variables of said logic program range over nodes of the parsing trees of documents to which said wrapper is applied.
100. The method of claim 94 wherein said documents are tree-structured and where the variables of said logic program range over nodes of the parsing trees and over strings occurring in the documents to which said wrapper is applied.
101. The method of claim 94 wherein said documents are tree-structured and where the variables of said logic program range over nodes and sequences of nodes of the parsing trees of documents to which said wrapper is applied.
102. The method of claim 94 wherein said documents are tree-structured and where the variables of said logic program range over nodes of the parsing tree of documents to which said wrapper is applied, over sequences of such nodes, and over strings occurring in documents to which said wrapper is applied.
103. The method of claim 94 wherein each condition that contributes to the definition of a pattern is represented by at least one atom in the body of a rule of said logic program which contributes to the definition of said pattern.
104. The method of claim 67 wherein the representation of each of said filters is organized according to a logical conjunction of all conditions contributing to said filter, and where the representation of each of said patterns is organized according to the logical disjunction of all filters contributing to said pattern, whereby a pattern can be defined via successive narrowing and broadening steps corresponding to the addition of a new condition to a filter of said pattern, and to the addition of a new filter for said pattern, respectively.
105. The method of claim 104 wherein the representation of each of said filters is organized as an explicit Boolean conjunction of all conditions contributing to said filter, and where the representation of each of said patterns is organized as an explicit logical disjunction of all filters contributing to said pattern, whereby a pattern can be defined via successive narrowing and broadening steps corresponding to the addition of a new condition to a filter of said pattern, and to the addition of a new filter for said pattern, respectively.
106. The method of claim 67 wherein each of said filters is represented as a logic programming rule and wherein each pattern is represented by a set of rules representing filters defining said pattern, and wherein said wrapper is a logic program consisting of all rules representing said filters.
107. The method of claim 104 wherein each of said logic programming rules representing a filter consists of a rule body that represents all conditions contributing to said filter in form of a conjunction of atoms and of a rule head whose predicate corresponds to the pattern said filter is associated with.
108. The method of claim 107 wherein said logic program is a datalog program making use of special predefined predicates for expressing conditions.
109. The method of claim 70 wherein each of said internal conditions belongs to a set consisting of contains conditions and noncontains conditions wherein the contains conditions impose one or more restrictions on some subelement of the pattern to be defined and the notcontains conditions require that instances of the pattern to be defined do not contain any subelement that satisfies specified restrictions.
110. The method of claim 70 wherein an element of a document is either an ordinary element or a sequence of contiguous elements, and wherein each of said internal conditions belongs to a set consisting contains conditions, notcontains conditions, firstsubtree conditions and lastsubtree conditions; wherein contains conditions impose one or more restrictions on some subelement of the pattern to be defined, notcontains conditions require that instances of the pattern to be defined do not contain subelements satisfying specified restrictions, firstsubtree conditions require that some element with specified properties be the first element of a sequence of elements to be defined; and lastsubtree conditions requires that some element with specified properties be the last element of a sequence of elements to be defined.
111. The method of claim 110 where said restrictions are selected from a set consisting of restrictions on font type, restrictions on font size, restrictions on font color, restrictions on the text contained in said subelement, restrictions on hyperlinks and other anchors, restriction on explicit positional parameters, restriction on element types, and restrictions on the value of a hidden attribute.
112. The method of claim 111 where at least one of said restrictions specifies that the target object must contain some given substring.
113. The method of claim 111 where at least one of said restrictions specifies that the target object must coincide with some given value.
114. The method of claim 111 where at least one of said restrictions is specified in form of a regular expression.
115. The method of claim 111 where at least one of said restrictions is expressed in terms of a predefined ontological concept.
116. The method of claim 70 where said contextual conditions are selected from a group consisting of before conditions, after conditions, notbefore conditions, notafter conditions and vertical positional conditions, wherein before conditions require that some element with specified properties must occur before a pattern instance, after conditions require that some element with specified properties must occur after a pattern instance, notbefore conditions require that some element with specified properties should not occur before a pattern instance, notafter conditions require that some element with specified properties should not occur after a pattern instance; and vertical positional conditions require that a pattern instance must occur in some vertical position relative to a specified element, where said vertical position is in a set comprising: below, above, not below, and not above.
117. The method of claim 57 wherein said example documents and said production documents are XML documents.
118. The method of claim 57 wherein said example documents and said production documents consist of unformatted text.
119. The method of claim 57 wherein said example documents and said production documents consist of text structured by structuring features from a set containing headlines, and titles, and spaces, and paragraphs, and indentations, and lists.
120. The method of claim 57 wherein said example documents and said production documents are formatted according to some document processing standard.
121. The method of claim 70 wherein said auxiliary conditions are selected from a group consisting of concept conditions, comparison conditions, pattern reference conditions, and conditions on documents, where concept conditions require that some part of a pattern to be defined matches an ontological concept definition, comparison conditions impose restrictions on the values of concept items, pattern reference conditions require that a pattern to be defined, or some element used within a pattern definition, be an instance of a pattern defined within the same wrapper; and conditions on document filters express restrictions on further documents to be loaded.
122. A method for interpreting an extended logic program over Web pages, comprising: setting variables and terms of said logic program so that they form a range over nodes parsing trees of said Web pages, setting an identification of each Web page to an extensional database whose data elements are nodes of the parsing tree of said Web page; and establishing parent-child relationship between the nodes of said parsing tree to be a binary relation, and whose data elements are ordered according to document order.
123. The method of claim 122, wherein said extended logic program contains special atoms corresponding to conditions from a plurality of conditions selected from a group consisting location conditions, internal conditions and contextual conditions, wherein location conditions impose restrictions on the location of nodes of said parsing-tree internal conditions impose restrictions on nodes of said parsing trees, relying on properties of the subtrees rooted in said nodes, and contextual conditions, imposing restrictions on the context in which nodes may appear within said parsing tree; wherein, in addition to classical inference procedures of logic programming, specific procedures are provided for evaluating said special atoms.
124. The method of claim 123, wherein, said extended logic program consists of a datalog program in which by said special atoms are treated as atoms of built-in predicates.
125. The method of claim 123, wherein range intervals can be specified with each rule of said extended logic program, and wherein only those facts are computed by a rule that match the ranges of said intervals in the ordering induced by document order.

Description



I. INTRODUCTION

A. RELATED APPLICATIONS

[0001] The present application claims priority from the copending U.S. Provisional Application Ser. No. 60/294,213, having the same title filed May 31, 2001

B. BACKGROUND

[0002] 1. Field of Invention

[0003] This disclosure teaches techniques related in general to the field of information processing. More particularly, the teachings relate to methods, systems and computer-program products for information extraction from Web pages, to the construction of wrappers (i.e. extraction programs), based on example Web pages, and to the transformation of "relevant" parts of HTML documents into XML.

[0004] 2. Basic Concepts, Terminology, and Introduction

[0005] The World Wide Web (abbreviated as Web) is the world's largest data repository and information retrieval system. In this environment, client machines effect transactions to Web servers using the Hypertext Transfer Protocol (HTTP), which is an application protocol usually providing user access to files formatted in a standard page description language known as Hypertext Markup Language (HTML). HTML provides basic document formatting (and some logical markup) and allows the developer to specify "links" to other servers and documents. In the Internet paradigm, a network location reference to a server or to a specific Web resource at a server (for example a Web page) is identified by a so-called Uniform Resource Locator (URL) having a well-defined syntax for describing such a network location. The use of an (HTML-compatible) browser (e.g. Netscape Navigator, Microsoft Internet Explorer, Amaya or Opera) at a client machine involves the specification of a link by the means of an URL. The client then makes a request to the server (also referred to as "Web site") identified by the link and receives in return an HTML document or some other object of a known file type. Simple and less sophisticated browsers can easily be written in a short time and with little effort in (object-oriented) programming languages such as Java, where powerful program libraries are available that already contain modules or classes providing the main functionalities of browsers (for example, the JEditorPane class of the javax.swing package).

[0006] Browsers or other applications working with HTML documents internally represent an HTML document in the form of a tree data structure that basically corresponds to a parse tree of the document. A model for representing and manipulating documents in form of trees is referred to as Document Object Model (DOM). Several DOMs for HTML documents have been defined and are used by different programming environments and applications, but the differences among these DOMs are rather inessential. An example for such a DOM is the so called "Swing DOM", which is part of the Javax Swing Package, a programming package containing useful libraries of Java classes for manipulating HTML documents. A DOM tree of an HTML document represents its hierarchical structure. In particular, the root of a DOM tree of an HTML document represents the entire document, while intermediate nodes of the tree represent intermediate elements such as tables, table rows, and so on. The leaves usually represent terminal (i.e., structurally indecomposable) data such as atomic text items or images. Each node of an HTML DOM tree can be associated with certain attributes that describe further features of the represented element (such as style, font size, color, indentation, and so on).

[0007] One important disadvantage of HTML is its main orientation as formatting and layout language, but not as data description language. In fact, the nodes of an HTML DOM tree are predefined elements that basically correspond to HTML formatting tags. Therefore it is difficult and very cumbersome (if at all possible) to query an HTML document using query languages in order to automatically extract useful and hierarchically structured information. Given that HTML provides no data description nor any tagging or labeling of data except for formatting purposes, it is often difficult and sometimes impossible to formulate a query that allows a system to distinguish, say, a first name from a family name or from an address appearing in the same HTML document. For this reason, web documents which are intended to be queried or processed by software applications are hierarchically organized using display-independent markup. Such, so-called semistructured, documents are often more suitably formatted in markup languages such as XML (eXtensible Markup Language).

[0008] XML is a standard for data exchange adopted by the World Wide Web Consortium (W3C) in 1999. The main advantage of XML is that it allows a designer of a document to label data elements using freely definable tags. The data elements can be organized in a hierarchy with arbitrarily deep nesting. Optionally, an XML document can contain a description of its grammar, the so-called Document Type Definition (DTD). An XML document or a set of such documents can be regarded as a database and can be directly processed by a database application or queried via one of the new XML query languages such as XSL, XSLT, XPath, XPointer, XQL, XML-QL, XML-GL, and XQuery. Moreover, powerful languages such as XSLT do not just serve for defining queries but can transform their output into an appropriate format suitable for further processing, e.g. into an email message or a piece of plain text to be sent to a cellular phone.

[0009] Note that most Web pages are still formatted in HTML. This is not expected to change soon, even though XML has been attracting a lot of attention. One reason for this may be that, due to the limited syntax of HTML, this language is somewhat easier to learn and to use than XML. Moreover, HTML documents are very often designed by laypersons, i.e., non-programmers, who are not suitably trained in the logical skills to systematically define data structures as required by XML and who therefore feel more comfortable using widely available editors and tools such as Dreamweaver, Frontpage or HotMetal in order to create HTML Web pages in a "what you see is what you get" manner. Furthermore, document designers often do not anticipate the need of others to process their documents automatically but mainly have a human beholder of their Web pages in mind. Finally, many companies deliberately refrain from offering data in XML format in order to obstruct automated processing of the published data by others.

[0010] On the other hand, there is a tremendous need for automating Web data processing and monitoring tasks. In the Business to Business (B2B) context it is often of crucial importance to a company to be immediately informed about price changes on the Web site of a competitor, about new public offerings or tenders popping up on a Web site of some corporate or government institution, or about changes in exchange rates, share quotas, and so on. Similarly, individuals can heavily profit from automated web monitoring. For example, imagine one would like to monitor interesting notebook offers at electronic auctions such as eBay (http://www.ebay.com). A notebook offer is considered interesting if, say, its price is below GBP 3000 (Great Britain Pounds), and if it has already received at least two offers by others. The eBay site allows one to make a keyword search for "notebook" and to specify a price range in USD (US Dollars) only. More complex queries such as the desired one cannot be formulated. Similar sites do not even give restricted query possibilities and leave you with a large number of result records organized in a huge table split over dozens of Web pages. One has to wade through all these records manually, because of no possibility to further restrict the result.

[0011] All these problems could be solved efficiently if the relevant parts of the respective source data were made available in XML format.

[0012] Thus, there is a significant need for methods and systems that are able to perform some or all of the following four tasks:

[0013] 1. Identify and isolate relevant parts or elements of (possibly remote) Web pages.

[0014] 2. Automatically extract the relevant parts of Web documents even though the respective documents may continually change contents and even (to a certain extent) structure.

[0015] 3. Suitably transform the extracted parts into XML to make them available for querying and further processing.

[0016] 4. Assist a developer or application programmer in creating and using programs or systems able to perform tasks (1), (2), and (3). A subtask of central importance is supporting the developer in the definition of relevant extraction patterns. Extraction patterns serve to identify information of one particular kind.

[0017] Tasks (1) and (2) together are often referred to as "Web information extraction" or also as "data extraction from the Web". Task (3) is referred to as "translation into XML". Note that a useful and meaningful translation into XML does not merely consist of reformatting an HTML document, according to the XML standard, but also in enriching the document with structural information and data description tags. The translated document will thus contain some structural and descriptive information that is not present in the original document.

[0018] A program specifying how the above tasks (1), (2), and (3) are to be performed is referred to as "wrapper" or "wrapper program". Wrappers may be written in a publicly available multi-purpose (procedural) programming language with primitives able to manipulate web resources (such as Java, C++, or Perl) in which case they can be compiled (or interpreted) and executed in a regular fashion using standard software resources (just as other programs in that language). Alternatively, wrappers can be formulated in some dedicated or proprietary high-level declarative language that needs a specially constructed interpreter or compiler.

[0019] A program or system that automatically or semi-automatically generates wrappers is referred to as "wrapper generator". A software tool that merely assists a human in manually programming and testing a wrapper, is referred to as "wrapper programming assistant". Task (4) can be solved by means of a wrapper generator, by means of a wrapper programming assistant, or by some hybrid tool.

[0020] 3. Desirable Properties of Methods and Systems for Wrapper Generation and Web Information Extraction

[0021] It is desirable to enable a very large number of computer users, including laypersons having no programming skills or expertise on HTML or similar formats, to create robust wrappers using a small number of sample pages, such that these wrappers are then able to automatically extract relevant and complex parts of Web pages and to translate the extracted information automatically into XML. With respect to this goal, a method or system for wrapper generation, Web data extraction, and translation into XML should fulfill at least some of the following properties:

[0022] High expressive power. The system should enable the definition of complex, structurally organized patterns from Web pages and translate the corresponding data (the so-called pattern instances) into a corresponding hierarchically structured XML document.

[0023] User friendliness. It should allow a human wrapper designer to design, program, or specify wrappers in a very short time. The user interaction should be efficient and suitable for constructing wrappers and specifying the XML translation.

[0024] Good learnability. The learning effort for being able to understand the method or use the system should be as small as possible. The method or system should be accessible to, and usable by, a layperson who is not a programmer or a computer scientist and has no programming experience. In the best case, it should not even require knowledge of HTML or XML, which means that a designer is never directly confronted with HTML or XML code (even the XML output can be displayed using nested tables).

[0025] Good visual support. It should offer the wrapper designer a GUI (graphical user interface) for specifying wrappers or XML translations. Ideally, the visual user interface allows a wrapper designer to work directly on displayed sample source documents (e.g. on HTML Web pages) and supports a purely visual way of defining extraction patterns.

[0026] Ease of accessibility and installation. The system should be widely accessible and should not require particular installation efforts. Ideally, the system provides an interface so that it can be used through a standard Web browser such as Netscape or Internet Explorer.

[0027] Parsimony of samples. In case the method or system uses sample pages as a basis for constructing wrappers, it should require only very few of these (a single one at best) for most applications. The reason is that, in many cases, a wrapper designer has only one or very few sample pages at hand. For example, if we decide to construct a wrapper to translate the homepage of the United States Patent and Trademark Office (USPTO) available at http://patents.uspto.gov/ into XML (e.g. in order to monitor upcoming new information and press releases and new federal register notes), then, at the time of wrapper construction, one instance of this page will be available at hand, namely, the current page. It should be possible to construct a wrapper based on this single instance which works well for future versions of this page.

[0028] Robustness. Wrappers are generally aimed at extracting information from similarly structured Web pages of changing content. It is obvious that wrappers risk failing to deliver a correct result if the structure of the source documents changes. However, a good wrapper is expected to have a certain degree of robustness, i.e., insensibility to minor structural changes. The method or system should allow the generation of fairly robust wrappers.

[0029] Runtime Efficiency. The method should provide efficient algorithms and the system should implement these algorithms efficiently such that the system becomes usable in practice and is highly scalable. (This is, of course, a general requirement to be fulfilled by almost all software methods and systems).

[0030] Smooth XML Interface. The method or system should provide a smooth and user-friendly way of translating the extracted data into XML in order to make it accessible to further processing, e.g. via XML query engines or well-known transformation languages such as XSLT. Ideally, the translation to XML is done automatically on the basis of the information gathered from the designer during the process of defining extraction patterns.

[0031] Clearly, a method and system fulfilling all these requirements is highly desirable and useful. In the paper "Content Integration for E-Business" (M. Stonebraker and J. M. Hellerstein "Content Integration for E-Business", Proceedings of SIGMOD 2001) some of the challenges needed for content integration are presented:". In short, a powerful, easy-to-use tools, is needed to address the broad challenges of cleaning, transforming, combining and editing content. These tools must be targeted at typical, non-technical content managers. In order to be useable the tools must be graphical and interactive, so that content managers can see the data as it is mapped. Any automated techniques must be made clearly visible, so that domain experts can edit and adjust the results. The development of semi-automatic content mapping and integration tools represents a new class of systems challenges, at the nexus of query processing, statistical and logical mapping techniques, and data visualization". The disclosed teachings are aimed at realizing some of the advantages and overcoming some of the disadvantages noted herein.

[0032] 4. References

[0033] The following documents provide background information helpful in understanding this disclosure, and to that extent, they are incorporated herein by reference. They are referred to, using the abbreviated notations shown below, in subsequent discussions to indicate specific relevance wherever necessary.

1
(1) U.S. Patent Documents [U1] U.S. Pat. No. 5,826,258 Gupta et al. 1998
[U2] U.S. Pat. No. 5,841,895 Huffmann 1998
[U3] U.S. Pat. No. 5,860,071 Ball et al. 1999
[U4] U.S. Pat. No. 5,898,836 Freivald et al. 1999
[U5] U.S. Pat. No. 5,913,214 Madnick et al. 1999
[U6] U.S. Pat. No. 5,983,268 Freivald et al. 1999
[U7] U.S. Pat. No. 6,102,969
Christianson et al. 2000
[U8] U.S. Pat. No. 6,128,655 Fields et al. 2000

(2) Other Publications

[0034] [S1] M. Stonebraker and J. M. Hellerstein "Content Integration for E-Business", Proceedings of SIGMOD 2001

[0035] [S2] G. Huck, P. Fankhauser, K. Aberer, E. Neuhold "Jedi: Extracting and Synthesizing Information from the Web", Proceedings of the 3rd IFICS International Conference on Cooperative Information Systems, CoopIfs'98, IEEE Computer Science Press, ISBN 0-8186-8380-5, pp. 32-43, 1998

[0036] [S3] W. May and G. Lausen, "Information Extraction from the Web", Technical Report No. 136, Institut fuer Informatik, Albert-Ludwigs-Universitaet, 79110 Freiburg, Germany

[0037] [S4] G. Mecca, P. Atzeni, "Cut and Paste", Journal of Computer and System Sciences, Vol. 58, No. 3, pp. 453-482, 1999

[0038] [S5] D. Konopnicky and O. Shmueli "Information Gathering in the WWW: The W3QS System", ACM Transactions on Database Systems, Vol.23 No.4, 1998

[0039] [S6] S.-J. Lim, and Y.-K. Ng "WebView: A Tool for Retrieving Internal Structures and Extracting Information from HTML Documents" Proceedings of the Sixth International Conference on Database Systems for Advanced Applications (DASFAA), Apr. 19-21 1999, Hsinchu, Taiwan, IEEE Computer Society Press, ISBN 0-7695-0084-6, pp. 71-80

[0040] [S7] F. Douglis, T. Ball, Y.-F. Chen, E. Koutsofio, "The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web", World Wide Web Vol.1 No.1, pp. 27-44, 1998

[0041] [S8] L. Liu, C. Pu, and Wei Tang, "Continual Queries for Internet Scale Event-Driven Information Delivery", IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 4, July/August 1999

[0042] [S9] Nicholas Kushmerick, Daniel S. Weld, Robert B. Doorenbos, "Wrapper Induction for Information Extraction", Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, IJCAI 97, Nagoya, Japan, Aug. 23-29, 1997. Morgan Kaufmann, 1997, Vol.1, pp. 729-737, 1997

[0043] [S10] C.-N. Hsu, M. Tsung Dung "Generating Finite-State Transducers for Semistructured Data Extraction from the Web", Information Systems, Vol.23, No. 8, pp.521-538, 1998

[0044] [S11] I. Muslea, S. Minton, and C. A. Knoblock "Hierarchical Wrapper Induction for Semistructured Information Sources", Journal of Autonomous Agents and Multi-Agent Systems, Vol. 4, No. 1/2, March 2001, pp. 93-114, 2001

[0045] [S12] B. Adelberg, "NoDoSe--A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents", Proceedings of the ACM SIGMOD International Conference on Management of Data, Jun. 2-4, 1998, Seattle, Wash., USA. ACM Press, ISBN 0-89791-995-5, pp. 283-294, 1998

[0046] [S13] Stephen W. Liddle, Douglas M. Campbell, Chad Crawford "Automatically Extracting Structure and Data from Business Reports", Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Kansas City, Mo., USA, Nov. 2-6, 1999, ACM Press 1999, ISBN 1-58113-146-1, pp. 86-93, 1999

[0047] [S14] S. B. Huffman, "Learning information extraction patterns from examples" in S. Wermter, E. Riloff, and G. Sheler, eds., "Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing", pp. 246-60. Springer-Verlag, 1996

[0048] [S15] David W. Embley, L. Xu, "Locating and Reconfiguring Records in Unstructured Multiple-Record Web Documents" Informal Proceedings of the WebDB 2000 International Workshop on the Web and Databases, Texas, USA, May 18-19, 2000, in conjunction with ACM PODS/SIGMOD 2000,pp. 123-128, 2000),

[0049] [S16] H. Davulcu, G. Yang, M. Kifer, and I. V. Ramakrishnan "Computational Aspects of Resilient Data Extraction from Semistructured Sources" Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, May 15-17, 2000, Dallas, Tex., USA. ACM 2000, ISBN 1-58113-214-X, pp. 136-144, 2000),

[0050] [S17] H. Davulcu, G. Yang, M. Kifer, and I. V. Ramakrishnan "Design and Implementation of the Physical Layer in WebBases: The XRover Experience", Proceedings of Computational Logic--CL 2000, First International Conference, London, UK, 24-28 Jul., 2000, Lecture Notes in Computer Science 1861 Springer 2000, ISBN 3-540-67797-6 pp. 1094-1105
2000

[0051] [S18] Naveen Ashish, Craig A. Knoblock, "Semi-Automatic Wrapper Generation for Internet Information Sources" Proceedings of the Second IFCIS International Conference on Cooperative Information Systems, CoopIS'97, Kiawah Island, South Carolina, USA, Jun. 24-27, 1997, IEEE Computer Science Press 1997, ISBN 0-8186-7946-8, 1997

[0052] [S19] L. Liu, C. Pu, and W. Han "XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources", Proceedings of the 16th International Conference on Data Engineering San Diego Calif., Feb. 28-Mar. 3, 2000, IEEE Computer Society Press, pp. 611-621, 2000

[0053] [S20] A. Sahuguet, F. Azavant, "Building Intelligent Web Applications Using Lightweight Wrappers", Data and Knowledge Engineering, Vol.36, pp. 283-316, 2000

[0054] [S21] B. Ribeiro-Neto and A. H. F. Laender and A. S. da Silva, "Extracting Semi-Structured Data Through Examples", Proc. of CIKM, 1999

[0055] [S22] J. R. Gruser, L. Raschid, M. E. Vidal, and L. Bright "Wrapper Generation for Web Accessible Data Sources", Proceedings of CoopIS 1998

[0056] [S23] R. Baumgartner, S. Flesca, G. Gottlob "Visual Web Information Extraction with Lixto", Proc. VLDB 2001

[0057] [S24] R. Baumgartner, S. Flesca, G. Gottlob "The Elog Web Extraction Language", Proc. LPAR 2001

[0058] [S25] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo "Extracting Semistructured Information from the Web", in Proceedings of the Workshop on Management of Semistructured Data. Tucson, Arizona, May 1997

[0059] [S26 ] A. O. Mendelzon, G. A. Mihaila, and T. Milo "Querying the World Wide Web", Journal of Digital Libraries, Vol. 1 No.1, pp.54-67, 1997

[0060] [W1] Orsus Solutions, "iGlue Wireless for Integrating Web to Wireless Business Processes", white paper, Orsus Solutions, Ltd., 1250
Oakmead Parkway #236, Sunnyvale, Calif. 94088, Catalog Number WPIGS 2.0/00, 2000

[0061] [W2] Orsus Solutions "Enabling e-Business with Business Process Integration" Orsus Solutions, Ltd., 1250 Oakmead Parkway #236, Sunnyvale, Calif. 94088, white paper, Catalog Number WPIGW 1.6/00, 2000

[0062] [B1] S. Ceri, G. Gottlob and L. Tanca, "Logic Programming and Databases", Surveys in Computer Science, Springer Verlag, 1990, ISBN 3-540-51728-6

[0063] [B2] Bernhard Thalheim, "Entity Relationship Modeling: Foundations of Database Technology", Springer, ISBN 3540654704

[0064] [B3] J. E. Hopcroft and J. Ullman: Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, ISBN: 0201441241

[0065] [B4] Martin Fowler, "UML Distilled: A Brief Guide to the Standard Object Modeling Language", Addison-Wesley, ISBN 020165783X

2
[H1] http://www.peacefire.org/tracerlock/ Tracerlock Tool [H2] http://www.netmind.com Mind-It [H3] http://www.cc.gatech.edu/projects/disl/XWRAP/xwrap.html X-Wrap Slides [H4] http://www.savarese.org/oro/software/OROMatcher1.1.html OroMatcher [H5] http://caesius.com Caesius WebQL [H6] http://www.x-fetch.com Republica X-Fetch [H7] http://www.wisosoft.com Wisosoft InfoScanner [H8] http://www.kapowtech.com Kapowtech RoboSuite

5. Description of Related Art

[0066] Prior Methods and systems related to those claimed here can be divided into four categories:

[0067] 1. Wrapper programming languages and environments,

[0068] 2. Web change monitoring and notification tools,

[0069] 3. Machine learning approaches, and,

[0070] 4. Supervised interactive wrapper generation.

[0071] These approaches are discussed in detail.

a) Wrapper Programming Languages and Environments

[0072] Wrapper programming languages are languages that allow a programmer to write a wrapper program. In the widest sense of this definition, every general-purpose programming language with means for accessing and manipulating internet resources such as Web pages is also a wrapper programming language. In this sense, programming languages such as Java, Perl, Python, TCL, etc. can be considered wrapper programming languages. Similarly, programming environments facilitating program generation in these languages and testing can be considered wrapper programming environments. There are, however, several languages and environments that are more specifically suited for wrapper programming because they include various special primitives and functions tailored for peculiar issues and needs of wrapper programming. An example is Jedi (Java based Extraction and Dissemination of Information) by Huck et al. (G. Huck, P. Fankhauser, K. Aberer, E. Neuhold "Jedi: Extracting and Synthesizing Information from the Web", Proceedings of the 3rd IFICS International Conference on Cooperative Information Systems, CoopIfs'98, IEEE Computer Science Press, ISBN 0-8186-8380-5, pp. 32-43, 1998). Jedi is a wrapper programming language that uses attributed grammars in order to deal with ambiguities in source documents. Another example is the Florid language (W. May and G. Lausen, "Information Extraction from the Web", Technical Report No. 136, Institut fuer Informatik, Albert-Ludwigs-Universitaet, 79110
Freiburg, Germany) which consists of a complex logic programming language enriched by several features facilitating wrapper programming. Another well-known approach is TSIMMIS (J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo "Extracting Semistructured Information from the Web", in Proceedings of the Workshop on Management of Semistructured Data. Tucson, Arizona, May 1997), where the extraction process is based on a procedural program augmented by the possibility of identifying parts of a tree structured document via path templates containing wildcards. A very simple and expressively limited method of wrapper programming based on regular grammars can be found in U.S. Pat. No. 5,826,258.

[0073] An advanced and rather expressive procedural wrapper programming language is EDITOR (G. Mecca, P. Atzeni, "Cut and Paste", Journal of Computer and System Sciences, Vol. 58, No. 3, pp. 453-482, 1999), which uses primitives such as "copy", "search", and replacement, borrowed from text editors, as well as special pattern matching algorithms. U.S. Pat. No. 6,102,969 describes a wrapper programming language called WDL (Wrapper Description Language) that facilitates the semantic description of queries, forms, and pages by using a declarative description format that combines features from grammars and regular expressions.

[0074] Finally, there are SQL-like query languages for the Web such as W3QL (D. Konopnicky and O. Shmueli "Information Gathering in the WWW: The W3QS System", ACM Transactions on Database Systems, Vol.23 No.4, 1998), and WebSQL (A. O. Mendelzon, G. A. Mihaila, and T. Milo "Querying the World Wide Web", Journal of Digital Libraries, Vol. 1 No.1, pp.54-67, 1997), and the SQL extension described in U.S. Pat. No. 5,913,214.

[0075] The system WebQL (http://caesius.com) is similar in syntax to SQL. It allows for crawling links and filling out forms, but is restricted to HTML output.

[0076] The system WebView (S.-J. Lim, and Y.-K. Ng "WebView: A Tool for Retrieving Internal Structures and Extracting Information from HTML Documents" Proceedings of the Sixth International Conference on Database Systems for Advanced Applications (DASFAA), Apr. 19-21 1999, Hsinchu, Taiwan, IEEE Computer Society Press, ISBN 0-7695-0084-6, pp. 71-80) is merely a parser for HTML documents (or hierarchies of URL-linked HTML documents) transforming them into a semi-structured data graph (SDG) which bears some similarity to a DOM tree and which can be subsequently queried. This system has almost no proper pattern location and extraction capabilities. The burden of data extraction is shifted to the user who queries the SDG via a complex SQL-like query language.

[0077] Generally (with a few exceptions), wrapper programming languages have a high expressive power. In particular, many of these languages are Turing complete in the sense that they can express all recursive functions over Web pages. (In addition, they have facilities for navigating the Web and for performing various operations on the Internet that are outside the scope of our analysis). They allow a programmer to implement rather robust wrappers, are fairly or even well accessible, often well supported by programming environments and often easy to install. However, there are serious drawbacks of wrapper programming languages. In particular, using wrapper programming languages is very hard for a non-expert, it is tedious, and requires, in addition to programming skills and expertise in the particular language, the knowledge of markup languages such as HTML. Testing and debugging hand written wrapper programs is difficult and time consuming. Wrapper programming languages are hard to learn for people not skilled in programming and offer no (or at most only very limited) visual support. Rather than working visually on a sample page, a programmer has to consider the HTML code of that pages while writing the program. This is often much more intricate than what one sees on the screen. Finally, wrapper programming languages offer no (or no efficient) support for translation into XML. The translation must be explicitly programmed or specified using programming language features such as grammars, which requires additional skills that non computer-scientists usually do not have. In summary, wrapper programming languages are not a suitable approach towards our goal of enabling a large number of non-experts to create wrappers for information extraction and translation into XML.

b) Web Change Monitoring and Notification Tools

[0078] There are tools and techniques for automatically monitoring Web pages for updates and notifying a user in case. Most of these tools just check whether the Web page has changed at all or search whether some user specified keyword happens to appear in an updated version of the page. Some of these tools apply such simple checks to multiple pages obtained through a Web search engine or to postings appearing in newsgroups (for example the TracerLock tool accessible via http://www.peacefire.org/trace- rlock/). Such tools have no substantial data extraction capabilities and can be disregarded as relevant prior art in the context of the disclosed invention.

[0079] Some systems such as the AT&T Internet Difference Engine AIDE (F. Douglis, T. Ball, Y.-F. Chen, E. Koutsofio, "The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web", World Wide Web Vol.1
No.1, pp. 27-44, 1998) related to U.S. Pat. No. 5,860,071 use a differencing algorithm in order to inform a user about those portions of text of a Web page that are new with respect to a previous version and those portions of text that have disappeared. However, no user-specified information extraction is possible.

[0080] Only very few Web change monitoring tools offer additional, somewhat more significant information extraction functions. The probably most relevant example is Mind-it (see http://www.netmind.com) in relationship with U.S. Pat. No. 5,898,836. In this approach, HTML documents are divided into sections delimited by HTML tags. On a visualized Web page a user can indicate those particular sections that should be monitored for changes and those which should be excluded from the monitoring process. The user can interactively select the relevant portions of a source document by dragging a highlight with a mouse over the text to be selected. Alternately, the user can select whole paragraphs by triple-clicking anywhere inside these sections, or double-clicking on a single word or by other similar actions. The system is able to adjust a region highlighted by a user to the smallest section containing it. The user has also the possibility to declare that a particular piece of text should be monitored by putting this piece of text into a special box via drag and drop actions. Change notifications are sent to the user only with respect to the selected sections or portions of text. Alternatively, the user may exclude some sections, and only the remaining sections will be monitored for change.

[0081] In U.S. Pat. No. 5,983,268, ways to identify the location of selected numeric values in a source document (e.g. a Web page formatted in HTML) are described. The preferred way to identify and remember the location of the selected numeric value is to store several characters immediately preceding and/or following the selection. These characters are known as "markers".

[0082] U.S. Pat. No. 6,128,655 contains (among other issues) the description of a simple form-based standard filtering mechanism for HTML Web pages. This mechanism allows the selection of certain sections (called "components") of a Web page by specifying sufficiently long start and end segments of the HTML code of these sections. The method is actually not used in the context of monitoring tools, but its function of identifying sections of Web pages is similar to features of such tools.

[0083] In any case, the pattern definition and extraction capabilities of (even advanced) Web change monitoring tools are limited to very simple "flat" patterns, such as paragraphs, portions of text, sections, and numeric values. No complex hierarchical data structures can be identified or extracted. The structural information contained in the parse tree of the document is not sufficiently exploited. Moreover, selected items, sections, or portions of text are merely identified via their textual surroundings (markers), their proper textual content, their surrounding tags, or by keyword-related issues. This identification is often not sufficiently robust. For example, in order to identify a numeric value location, a system-generated marker might contain some unsuitable unrelated element belonging to a different tree path. This element itself may be subject to frequent changes. If it changes, the information about the numeric value location is likely to be lost. Often one would like to monitor (or extract) many similar items of a specific type, for example all records of an eBay auction page containing a certain keyword (including future records of the same type). Even such simple requests cannot be expressed by current Web change monitoring tools. Finally (and not surprisingly), Web change monitoring tools do not have facilities for translating structured content into XML. In summary, these tools are far from fulfilling our above described goals and requirements. On the positive side, Web change monitoring and notification tools such as Mind-it are very easy to use, require almost no learning effort, have an excellent visual support, are easily accessible over the Internet, and are algorithmically rather efficient.

[0084] Note that there are Web change monitoring and notification tools that have much better information extraction capabilities because they rely on a separate and independent supervised wrapper generator. An example is the continual query system OpenCQ (L. Liu, C. Pu, and Wei Tang, "Continual Queries for Internet Scale Event-Driven Information Delivery", IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 4, July/August 1999) whose capabilities for wrapping portions of HTML pages are mainly due to the supervised wrapper generator XWRAP which constitutes a separate piece of software. We excluded such Web change monitoring tools from our discussion and analysis here because the class of corresponding wrapper generators (including XWRAP) will be discussed in the section on supervised interactive wrapper generation below.

c) Machine Learning Approaches

[0085] Inductive machine learning, i.e., learning from examples and counterexamples (i.e. positive and negative examples), is a sub-discipline of Computer Science which has recently found many applications in the area of information extraction and wrapper generation. The aim is to induce document structure information and extraction patterns and automatically generate extraction rules and procedures based on a set of similarly structured sample Web documents or on a set of similar items occurring in a Web page. There is a variety of inductive wrapper generators both for arbitrary text documents (including HTML documents as a special case) and for HTML documents. These approaches mainly differ in the applied machine learning technique for inducing grammars or automata able to recognize structural properties and relevant data items in web documents. Many publications refer to machine-learning related theoretic problems; only a few describe practically working systems with an interactive environment for visually selecting examples and counterexamples of pattern instances, and for highlighting pattern instances induced by the system. An example for such a system is the Wien wrapper induction environment (Nicholas Kushmerick, Daniel S. Weld, Robert B. Doorenbos, "Wrapper Induction for Information Extraction", Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, IJCAI 97, Nagoya, Japan, Aug. 23-29, 1997. Morgan Kaufmann, 1997, Vol.1, pp. 729-737, 1997). Other well-known examples include SoftMealy (C.-N. Hsu, M. Tsung Dung "Generating Finite-State Transducers for Semistructured Data Extraction from the Web", Information Systems, Vol.23, No. 8, pp.521-538, 1998) and Stalker (I. Muslea, S. Minton, and C. A. Knoblock "Hierarchical Wrapper Induction for Semistructured Information Sources", Journal of Autonomous Agents and Multi-Agent Systems, Vol. 4, No. 1/2, March 2001, pp. 93-114, 2001). Systems particularly well suited for inductive learning from plain text documents (but also able to be applied to HTML documents) are:

[0086] NoDoSe (B. Adelberg, "NoDoSe--A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents", Proceedings of the ACM SIGMOD International Conference on Management of Data, Jun. 2-4, 1998, Seattle, Wash., USA. ACM Press, ISBN 0-89791-995-5, pp. 283-294, 1998).

[0087] The system by Liddle et al. (Stephen W. Liddle, Douglas M. Campbell, Chad Crawford "Automatically Extracting Structure and Data from Business Reports", Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Kansas City, Mo., USA, Nov. 2-6, 1999, ACM Press 1999, ISBN 1-58113-146-1, pp. 86-93, 1999).

[0088] The LIEP system (S. B. Huffman, "Learning information extraction patterns from examples" in S. Wermter, E. Riloff, and G. Sheler, eds., "Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing", pp. 246-60. Springer-Verlag, 1996.) See also U.S. Pat. No. 5,841,895.

[0089] The following papers describe approaches using example-based generalization techniques related to learning.

[0090] Embley and Xu 2000 (David W. Embley, L. Xu, "Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents" Informal Proceedings of the WebDB 2000 International Workshop on the Web and Databases, Texas, USA, May 18-19, 2000, in conjunction with ACM PODS/SIGMOD 2000,pp. 123-128, 2000), where a model for locating records via heuristic algorithms based on a domain-specific ontology is presented.

[0091] Davulcu et al. 2000a (H. Davulcu, G. Yang, M. Kifer, and I. V. Ramakrishnan "Computational Aspects of Resilient Data Extraction from Semistructured Sources" Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, May 15-17, 2000, Dallas, Tex., USA. ACM 2000, ISBN 1-58113-214-X, pp. 136-144, 2000), where methods for generalizing extraction expressions are presented.

[0092] Davulcu et al. 2000b (H. Davulcu, G. Yang, M. Kifer, and I. V. Ramakrishnan "Design and Implementation of the Physical Layer in WebBases: The XRover Experience", Proceedings of Computational Logic--CL 2000, First International Conference, London, UK, 24-28 Jul., 2000, Lecture Notes in Computer Science 1861 Springer 2000, ISBN 3-540-67797-6
pp. 1094-1105 2000), where the system XRover is presented which (among other features) can extract certain flat patterns from Web Documents based on generalized extraction expressions as described in the above-cited reference.

[0093] While wrapper generation by machine learning (or related techniques of automatic generalization) is an appealing topic for theoretical and experimental research, it is rarely used in practice. Inductive learning has several severe drawbacks in the context of wrapper generation. In particular, in order to induce a working wrapper, one often needs a large number of sample Web pages or a large number of sample target items on a Web page. Obtaining such samples is in many cases impossible, e.g., because there is only one "current" page available whose content changes slowly. Producing fake samples via self-made perturbations of existing data material (as suggested. in Davulcu et al. 2000a) is tedious and, moreover, can lead to unrealistic samples in case this task is automated. Our criterion of sample parsimony is thus violated.

[0094] Note also that information extraction approaches based on machine learning have a rather limited expressive power. In fact, efficient learning algorithms are best suited for flat repetitive patterns and cannot be easily applied to induce complex hierarchical data structures. Moreover, only those features present in the samples can be learned. For similar reasons, wrappers generated by machine learning are often not quite robust. For example, if a wrapper generator is supposed to identify a certain pattern by the property that this pattern occurs immediately behind a date item, and if there are hundred sample pages that unluckily have date items of the same year, e.g. 2001, then the system risks to erroneously induce that the desired pattern must occur after the number 2001 and will not work on pages with dates of the following year. Redressing such situations requires substantial human interaction and supervision, or the application of ontological reasoning tasks that work in few cases only.

[0095] The user-friendliness of machine learning tools can be rated medium at best. In fact, the selection of samples and the hand-labeling of a sufficient number of examples and counterexamples in Web pages as well as the correction of certain errors can be rather tedious. Learning to use such systems is generally not too difficult, but requires the acquisition of some basic skills of selecting samples and of estimating their quality. Current systems with visual support are not directly usable through Web browsers but require the installation of specific software. Due to the complexity of the underlying algorithms, machine learning tools are not quite runtime efficient in case of slightly more complex patterns.

[0096] Several wrapper learning tools offer visual support, mainly limited to visual facilities for selecting pattern instances, labeling data, and for displaying induced results. The most advanced visual support is offered by NoDoSe, where visual and interactive means for learning structured data and translation into XML are given. This works very nicely on plain text documents. Unfortunately, NoDoSe does not work that well on HTML files. In fact, processing HTML files with NoDoSe requires a wrapper designer to perform a pre-processing step that transforms the HTML file into a table. Then the wrapper developer works visually on that table rather than on the original document. Unfortunately, in many cases, the table can become extremely large and confusing. For example, if NoDoSe is fed with samples consisting of similarly structured eBay pages containing lists of auction items, then the pre-processor transforms these samples into a table with hundreds of columns which is confusing and difficult use. Most wrapper learning tools apart from NoDoSe do not support translation into XML.

[0097] In summary, wrapper generating tools based on machine learning techniques are not suitable tools in the light of the above-stated requirements.

d) Supervised Interactive Wrapper Generation

[0098] This class comprises methods and tools for semi-automatically constructing wrappers of advanced expressive power via an interactive process that is closely supervised by a human. These approaches are based on the assumption that a human who intends to design a wrapper knows best what kind of patterns are to be extracted from Web documents and how such patterns should be characterized so to make the characterization robust w.r.t. future changes of the document structure and content. Thus, supervised wrapper generators are much more based on the "teaching" paradigm rather than on self-learning aspects.

[0099] An early and simple system for semi-automatically generating wrappers for structured internet sources was designed by Ashish and Knoblock (Naveen Ashish, Craig A. Knoblock, "Semi-Automatic Wrapper Generation for Internet Information Sources" Proceedings of the Second IFCIS International Conference on Cooperative Information Systems, CoopIS'97, Kiawah Island, S.C., USA, Jun. 24-27, 1997, IEEE Computer Science Press 1997, ISBN 0-8186-7946-8, 1997). This system guesses a page structure and constructs corresponding labeled patterns by automatically decomposing a page according to the explicit nesting hierarchy of the document determined by section headings and formatting information provided by HTML (characterized by font size, etc.) or indentation of text items. The user interaction during this structuring phase is limited to correcting certain erroneously recognized tokens that are highlighted on sample documents. The system outputs a grammar describing the nesting hierarchy of sections. This grammar then allows the system to parse similar documents, associating titles to sections and enabling thus a user to formulate simple queries. The expressive power of this approach is extremely limited. The approach assumes that some explicit description (or tagging) of the relevant data is already contained in the sample document (in form of names of section headers) or at least unambiguously determined by simple attributes of headers (such as their font size), which is very often not the case. There is no possibility of locating desired extraction items by more complex contextual conditions. Moreover, no user-defined complex patterns can be created that are not already determined by basic formatting issues and paragraph headings. Accordingly, the constructed wrappers are not robust, except for sets of input documents having exactly the same formatting structure. No translation into XML is offered.

[0100] Another early approach of supervised wrapper generation was developed in "Wrapper Generation for Web Accessible Data Sources" by J. R. Gruser, L. Raschid, M. E. Vidal, and L. Bright at the Proceedings of CoopIS 1998. A translation to a semistructured data format can be generated.

[0101] The following systems are more advanced and can be considered the prior art most related to our own invention.

[0102] The XWRAP system by Ling Liu, Carlton Pu and Wei Han (L. Liu, C. Pu, and W. Han "XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources", Proceedings of the 16th International Conference on Data Engineering San Diego Calif., February 28-March 3,2000, IEEE Computer Society Press, pp. 611-621, 2000). See also the demonstration slides (XWrap demonstration slides, available at http://www.cc.gatech.edu- /projects/disl/XWRAP/xwrap.html). The system consists of a toolkit for interactively developing wrappers and for translation into XML. According to the description in the latter reference, "XWrap normalizes the Web page into a tree structure using a source-specific parser. The region extraction wizard allows the developer to specify rules to identify important regions. Region properties, such as region range, are based on tree-path regular expressions of selected nodes in the tree. In order to reach a particular node in the tree, the developer can simply highlight strings on the HTML page, and XWrap can associate the string to its corresponding tree node. XWrap creates region extraction rules automatically after the developer specifies region properties through the GUI interface. XWrap analyzes three types of regions, table, list, and paragraph. [. . . ] The semantic token extraction allows the developer to specify rules to extract tokens of interest in each region. XWrap derives the hierarchy structure of information of interest from the semantic token extraction rules and region extraction rules. Hierarchy structure extraction rules are composed in XML template format, which directs the template engine to some special placeholders, where data fields should be inserted into the templates. If we want to produce XML data as wrapper results, it will facilitate the wrapper code generation. After we get all the extraction rules, XWrap can compile these rules into a runable program."

[0103] The main drawbacks of this system are:

[0104] The limited expressive power of its pattern definition mechanism. First, the system lacks sufficiently powerful semi-automatic generalization mechanisms that allow a user to specify several similar patterns at once while marking only a single pattern of the desired type. Secondly, the system lacks sufficient visual facilities for imposing inherent (internal) or contextual (external) conditions to an extraction pattern, e.g. "extraction pattern X should appear after a recognized instance of the extraction pattern Y but not before an instance of pattern Z', and so on. Thirdly, the division into the two levels of description "region" and "token" and the automatic hierarchical structure extractor severely limit the ways to define extraction patterns. For example, it is impossible to create disjunctive pattern definitions where a developer specifies a hierarchical pattern in terms of several alternative descriptions.

[0105] A limited visual user interface. The user does not directly select all regions and tags on the browser-displayed Web page but is forced to additionally use other windows (or frames) which display the parse tree of the document. This is rather tedious in case of complex documents and, in addition, implies that the user must be able to understand HTML code.

[0106] The World Wide Web Wrapper Factory toolkit W4F (A. Sahuguet, F. Azavant, "Building Intelligent Web Applications Using Lightweight Wrappers", Data and Knowledge Engineering, Vol.36, pp. 283-316, 2000). This system offers an advanced programming environment for an SQL-like query language called HEL (HTML Extraction Language) for semi-structured documents. Parts of the SQL-like query can be generated using a specialized visual extraction wizard which is limited to returning the full DOM tree path of an element which the user wants to extract and interactively selects by directly pointing to it on the browser-displayed Web page. Note that this is just a support for constructing HEL queries. Except for certain trivial extraction tasks, the wizard is not able to generate a full query. Instead, the full query must be programmed by the user by hand-editing and generalizing tree-paths generated by the wizard and adding further HEL language constructs. The expressive power of the set of queries that can be visually generated (without hand coding) is extremely limited. This implies that a user of the W4F system is required to have both expertise of the HEL language and expertise of HTML. This, in turn, means that, notwithstanding the visual support by the wizard, the W4F system shares the main aforementioned disadvantages of wrapper programming languages. At the cost of necessary programming efforts, HEL is clearly more expressive than the visual pattern definition method of XWRAP. Note, however, that HEL is a rather complex language requiring a tricky use of index variables and fork constructs in order to correctly describe hierarchically ordered and/or compound extraction patterns. The language is hard to learn and hard to use for a person not trained in computer science. W4F also contains a so called "mapping wizard" that helps the user to define a translation of the query output into XML, and a visual interface for testing and refining the wrapper interactively before deployment. However, the testing wizard does not display extracted instances of patterns directly on the browser-displayed input document. Instead, it shows the user the XML output data structure.

[0107] DEByE (B. Ribeiro-Neto and A. H. F. Laender and A. S. da Silva, "Extracting Semi-Structured Data Through Examples", Proc. of CIKM, 1999) is an example-based system relying on bottom-up extraction. Although this kind of extraction has some advantages, programs are very difficult to update. The system is embedded into an information processing environment called WEByE. DEByE relies on "symmetric passages" conditions, which are not that expressive.

[0108] Republica's X-Fetch Wrapper (http://www.x-fetch.com) uses its own script language DEL for generation of extraction rules. One of the major disadvantages of X-Fetch Wrapper is that the tree structure of H TML is completely neglected. Extraction is solely based on regular expressions embedded into this extraction language. DEL programs can either be written with the DEL Editor, which assists the user to write such script files, or alternatively in the Rule Generator. The Rule Generator assists the wrapper developer on very regular pages containing a list of records. On such pages, the designer can label data areas of his choice; however, very often she has to re-edit the program with the DEL Editor. DEL programs can be written for other source formats such as Word documents. The implementation of X-Fetch wrapper is not platform-independent and relies on a Windows architecture.

[0109] Another tool, the RoboSuite of Kapow Technologies (http://www.kapowtech.com) offers fully visual support for wrapper generation. The designer can specify various kinds of options. Extraction rules navigate both the document tree and regular expressions. One of the major drawbacks of the system is that no nested output XML can be generated, but just a flat relational scheme. Additionally, no flexible contextual conditions are available.

[0110] The wisosoft InfoScanner (http://www.wisosoft.com) suffers a similar disadvantage as RoboSuite. Output patterns just form a flat relational scheme, and are not connected to each other in a general tree-structure. Hence, both approaches do not use the full power of XML. In InfoScanner, additionally, the developer is forced to interact with the script program. However, visual support is offered to create such scripts.

[0111] The iGlue software by Orsus Solutions Ltd. (Orsus Solutions, "iGlue Wireless for Integrating Web to Wireless Business Processes", white paper, Orsus Solutions, Ltd., 1250 Oakmead Parkway #236, Sunnyvale, Calif. 94088, Catalog Number WPIGS 2.0/00, 2000) and (Orsus Solutions "Enabling e-Business with Business Process Integration" Orsus Solutions, Ltd., 1250 Oakmead Parkway #236, Sunnyvale, Calif. 94088, white paper, Catalog Number WPIGW 1.6/00, 2000). This software (consisting of the two packages iGlue/Wireless and iGlue/Web) is used for integrating business processes over the Internet and contains, among other features, a graphical tool for extracting content from HTML or XML formatted Web pages. As stated in the first cited white paper, "a developer can point to a Web page and highlight the data to be extracted. The data extraction editor converts the page into XML displaying a hierarchical tree of the XML page, with the selected data highlighted. Using drag and drop, as well as built-in filters, an XQL or XPath query can be built, and the extracted data is stored in a table." The mentioned translation into XML at the beginning of the process is not an extraction task but merely a re-formatting of the HTML source into an equivalent well-formed XML document (in particular, containing all HTML tags of the original document). Therefore the developer needs to understand HTML. For defining the extraction patterns, the developer mainly uses a window (or frame) containing a visualization of the XML tree and another window where the XQL or XPath query can be edited. That is, the extraction pattern is not directly defined by acting on (and interacting with) a browser-displayed image of the source document. As for the W4F system, the visual interactive part of the extraction-pattern definition process is mainly limited to returning a full tree path which must be manually post-processed. Again, the expressive power of those XQL or XPath queries that can be generated by fully interactive actions without manual editing is extremely limited. For most applications, the developer will need to edit an XPath expression or an XQL query. This means that the user is supposed to have familiarity with at least one of these languages. Complex hierarchical patterns can not be defined visually. Contextual conditions on extraction patterns cannot be easily expressed (if at all).

[0112] In summary, existing supervised wrapper generation tools suffer from serious deficiencies. First, their extraction capability is of limited expressive power (at least, when no editing or programming is involved). Second, the visual interfaces of these tools require the developer to work on an explicitly represented DOM tree rather than just on the browser-displayed source document. It follows that these tools require expertise not commonly owned by laypersons.

[0113] The analysis of the various conventional approaches to wrapper generation for semistructured Web documents is roughly summarized in Table 1 (the actual evaluation may pointwise differ for a few single tools, according to the discussion above). This analysis clearly shows that the existing approaches have various drawbacks and do not meet the aforementioned desirable goals and criteria. It should also be clear that a method and system satisfying all these criteria cannot be obtained by merely combining features of the existing methods and systems. For example, the visual interface of Mind-It is good, but this is to be seen in close relationship with the limited expressive power of this tool; the visual pattern definition mechanisms of Mind-it cannot be carried over to a tool that allows a developer to specify more complex extraction patterns. To overcome the problems of prior approaches, substantially new ideas and techniques are necessary.

3TABLE 1
Summary of evaluation of previous approaches. Wrapper Change programming monitoring Supervised Languages and and notification Machine wrapper environments tools learning generation Expressive power good very bad bad bad/medium User friendliness bad good medium medium Ease of learning bad good fair bad/medium Visual support bad good fair medium/fair Access & fair/good good medium medium/fair installation ease Sample parsimony good good very bad good Robustness good medium medium medium/fair Runtime efficiency good good bad/medium good XML translation fair very bad medium fair/good

II. SUMMARY

[0114] The disclosed teachings provide methods, systems and computer-program products for the visual and interactive generation of wrappers for Web pages under the supervision of a human developer, for automatically extracting information from Web pages using such wrappers, and for translating the extracted content into XML. The features included are:

[0115] a method and system for interactively and visually defining information extraction patterns under the supervision of a human developer on the base of visualized sample Web pages;

[0116] a method and system for successively and hierarchically collecting information extraction patterns in order to generate a wrapper;

[0117] a method for logically representing the knowledge about sets of desired extraction patterns which jointly constitute a wrapper specification, as well as an abstract data structure that refines the logical definition and renders it more precise;

[0118] a declarative logic programming language, called Elog, for effectively encoding pattern descriptions (and thus wrappers) in a format that meets said data structure;

[0119] a method and system for executing wrappers (corresponding to a set of previously defined extraction patterns) on local or remote Web pages and thus for automatically extracting relevant information from said Web pages;

[0120] a data representation method for data extracted from Web pages, i.e., for pattern instances;

[0121] a method for defining XML translation rules that specify how extracted content should be translated into XML and for constructing an XML Document Type Definition (DTD) for that output;

[0122] a method and system for effectively translating extracted data into XML format.

[0123] The totality of the disclosed teachings constitutes an integrated and compound technique for specifying extraction patterns, for extracting information corresponding to such patterns, specifying XML translation rules, and translating extracted information into XML, as well as for organizing the overall process. As part of this integrated technique is a corresponding system. The global method and the global system are both referred to as "Lixto" (more specifically, as the Lixto method and the Lixto System). Note that Lixto is not confined to a single embodiment but can be embodied in various ways. It should be clear that while Lixto is discussed in detail, it is only an example implementation and should not be construed to restrict the scope of the claims in any way.

[0124] Further, computer-program products including computer-readable media with instructions that implement the systems and methods disclosed, completely or partially, are also contemplated as being within the overall scope of the disclosed teachings. It should be noted that the media could be anything including but not limited to RAMs, ROMs, hard disks, CDs, tapes, floppy disks, Internet downloads, etc. In short, any medium that can fix all or a subset of the instructions, even for a transient period of time, is considered as a computer-readable media for the purposes of this disclosed teaching. Further the type of computer that can implement is also not restricted to any particular type, but includes personal computers, workstations, mainframes and the like. Also, it can be implemented on a stand-alone computer or in a distributed fashion across a network, including but not limited to, the Internet.

III. BRIEF DESCRIPTION OF THE DRAWINGS

[0125] 1. Architecture Overview: A diagram depicting the overview of the Lixto architecture.

[0126] 2. Architecture Overview: Using the extractor as stand-alone program.

[0127] 3. Pattern-Filter-Diagram: Logical structure of a sample pattern illustrating the general pattern structure.

[0128] 4. Package Architecture of preferred embodiment: Package structure of actual implementation.

[0129] 5. Lixto Screenshots: Vectorized screenshots showing Lixto at work.

[0130] 6. Lixto Screenshots: Vectorized screenshots showing Lixto at work (part 2).

[0131] 7. Table of empirical results: Evaluation of Lixto w.r.t. different sample web sites.

[0132] 8. Example of an HTML tree: Illustration of a possible way to parse an HTML document.

[0133] 9. Pattern EER: The entity relationship diagram of pattern and filters.

[0134] 10. Filter EER: The entity relationship diagram of filters and their constituents.

[0135] 11. Rule Evaluation: Algorithm of evaluating an Elog rule.

[0136] 12. subsequence evaluation: Algorithm of evaluating the "subsequence" predicate.

[0137] 13. before evaluation: Algorithm of evaluating the "before" predicate for tree filters.

[0138] 14. before string evaluation: Algorithm for evaluating the "before" predicate for string filters.

[0139] 15. Example pattern structure: Example wrapper program structures.

[0140] 16. Pattern Generation: Pattern Generation Algorithm.

[0141] 17. Tree Filter Generation: Algorithm for creation of tree filters.

[0142] 18. String Filter Generation (including concept/comparison): Algorithm for creation of text filters.

[0143] 19. Attribute Filter Generation: Algorithm for creation of attribute filters.

[0144] 20. Tree Condition Generation External (before,after,notbefore,nota- fter): Algorithm for creating external tree conditions.

[0145] 21. Tree Condition Generation Internal (contains, firstson[startwith], lastson[endwith]): Algorithm for creating internal tree conditions.

[0146] 22. Attribute Selection (including to add concepts and comparison predicates): Algorithm for specifying attribute conditions that refer to constant values or to concepts.

[0147] 23. String Condition Generation External: Algorithm for creating external string conditions.

[0148] 24. String Condition Generation Internal: Algorithm for creating internal string conditions.

[0149] 25. Comparison Selection: Algorithm for adding comparison conditions.

[0150] 26. Range Condition Generation: Algorithm for imposing range intervals to a filter.

[0151] 27. Computing a tree region and basic extraction definition atom based on offsets: Algorithm for computing a tree region and the main filter atom.

[0152] 28. Generating a Document Filter: Algorithm for creating a document filter.

[0153] 29. Pattern Generation with Recursive Wrapping: Pattern Generation Algorithm in case of recursive wrapping.

[0154] 30. Distance Tolerance Selection for External Conditions: How distance tolerance values are computed.

[0155] 31. Evaluation of an Example Elog Program: Illustrating an example evaluation for a recursive wrapper program.

[0156] 32. Extraction Job Server: This extraction job server contains several jobs, each creating an XML companion.

IV. DETAILED DESCRIPTION

[0157] The following description of a preferred embodiment of our invention and of some variations and ramifications thereof is made for illustrating the general principles of the invention and is not to be taken in any limiting sense.

[0158] A. Synopsis of an Implementation

[0159] In a preferred embodiment, the Lixto architecture (FIG. 1) consists of two main building blocks: The Visual Builder and the Program Evaluator. The visual builder allows a wrapper designer to create and to store a wrapper in form of an extraction program (in our preferred embodiment, a program in the language Elog). Moreover, the visual builder allows a designer to specify how extracted data should be translated into XML format and to store such a specification in form of an XML translation scheme consisting of a list of XML translation rules. The program evaluator automatically executes an extraction program and a corresponding XML translation scheme over Web pages by extracting data from them and translating the extracted data into XML format. The program evaluator can act as a stand-alone program (FIG. 2) performing data extraction from Web pages based on previously constructed extraction programs and XML translation scheme. However, the program evaluator is also used during the wrapper designing phase in order to test partial or full wrapper programs.

[0160] The Lixto system can be installed and run as an application on a computer of a designer or end-user, or in a server mode, where designers and/or end users access a running Lixto implementation on some server via a standard browser through the Internet. In the latter case, no installation by a designer or user is necessary.

[0161] Lixto allows a non-professional user to build wrappers directly on the base of one or more browser-displayed sample Web pages (FIG. 5). A wrapper is constructed by formalizing, collecting, and storing the knowledge about desired extraction patterns. Extraction patterns (FIG. 9) describe single data items or chunks of coherent data to be extracted from Web pages by their locations and by their characteristic internal or contextual properties. Extr