United States Patent Application20020069218
Kind CodeA1
Sull, Sanghoon ; et al.June 6, 2002

System and method for indexing, searching, identifying, and editing portions of electronic multimedia files
Abstract
A method and system are provided for tagging, indexing, searching, retrieving, manipulating, and editing video images on a wide area network such as the Internet. A first set of methods is provided for enabling users to add bookmarks to multimedia files, such as movies, and audio files, such as music. The multimedia bookmark facilitates the searching of portions or segments of multimedia files, particularly when used in conjunction with a search engine. Additional methods are provided that reformat a video image for use on a variety of devices that have a wide range of resolutions by selecting some material (in the case of smaller resolutions) or more material (in the case of larger resolutions) from the same multimedia file. Still more methods are provided for interrogating images that contain textual information (in graphical form) so that the text may be copied to a tag or bookmark that can itself be indexed and searched to facilitate later retrieval via a search engine.

Inventors:Sull; Sanghoon (Seoul, KR), Kim; Hyeokman  (Seoul, KR), Choi; Hyungseok  (Seoul, KR), Chung; Min Gyo  (Sungnam City, KR), Yoon; Ja-Cheon  (Seoul, KR), Oh; Jeongtaek  (Seoul, KR), Lee; Sangwook  (Seoul, KR), Song; S. Moon-Ho  (Seoul, KR), Kim; Jung Rim  (Seoul, KR), Lee; Keansub  (Suwon City, KR), Chun; Seong Soo  (Songnam City, KR), Oh; Sangwook  (Cheju City, KR), Kim; Yunam  (Cheju City, KR)
Correspondence Name and Address:Baker Botts L.L.P. One Shell Plaza 910 Louisiana Street
Ronald L. Chichester
Houston
TX
77002-4995
US
Series Code:911293
Filed:July 23, 2001
U.S. Current Class:707/501.1
U.S. Class at Publication:707/501.1
Intern'l Class:G06F 015/00

Claims


What is claimed is:
1. A system for accessing multimedia content stored in a multimedia file having a beginning and an intermediate point, the content having at least one segment at the intermediate point, the system comprising: a multimedia bookmark, the multimedia bookmark having content information about the segment at the intermediate point; wherein a user can utilize the multimedia bookmark to access the segment without accessing the beginning of the multimedia file.

2. The system of claim 1 further comprising a search mechanism that locates the segment in the multimedia file.

3. The system of claim 2 further comprising an access mechanism that reads the multimedia content at the segment designated by the multimedia bookmark.

4. The system of claim 1, wherein the multimedia content is partial data related to a particular at least one segment.

5. The system of claim 1, wherein the multimedia content is visual data comprising one or more frames of video.

6. The system of claim 1, wherein the multimedia content is audio data.

7. The system of claim 1, wherein the multimedia content is a string of characters.

8. The system of claim 1, wherein the multimedia bookmark further comprises positional information about the segment.

9. The system of claim 8, wherein the positional information is a URI.

10. The system of claim 8, wherein the positional information includes an elapsed time.

11. The system of claim 8, wherein the positional information includes a time code.

12. The system of claim 1, wherein the multimedia file is contained on local storage.

13. The system of claim 12, wherein the local storage includes a database.

14. The system of claim 1, wherein the multimedia file is stored on a device accessible via a network.

15. The system of claim 14, wherein the network is the Internet.

16. A system for accessing multimedia content encoded in a master file having a beginning point and an end point and at least one variation file derived from the master file, the system comprising: a segment of the file having a beginning point after the beginning point of the master file and an end point before the end point of the master file that are designated by a user; a multimedia bookmark, the multimedia bookmark having content information about the segment; wherein the user can access the same segment on the master file and the variation file via the multimedia bookmark.

17. The system of claim 16 further comprising a search mechanism that locates the segment in the multimedia file.

18. The system of claim 17 further comprising an access mechanism that reads the multimedia content at the segment designated by the multimedia bookmark.

19. The system of claim 16, wherein the at least two variations are accessible from a network.

20. The system of claim 19, wherein the network is the Internet.

21. The system of claim 16, wherein the multimedia bookmark is accessible from a network.

22. The system of claim 21, wherein the multimedia bookmark is stored in a database.

23. The system of claim 21, wherein the multimedia bookmark is indexed in a search engine.

24. The system of claim 16 further comprising metadata constructed and arranged to store a media profile for each variation file, the media profile containing offset information representing a start time and an end time of the segment that is correlated with the master file.

25. The system of claim 24, wherein the offset information of a variation file is calculated by aligning a referential segment between two different time points from the master file and the variation file.

26. The system of claim 25, wherein the master file is a video.

27. The system of claim 26, wherein the referential segment is between two successive shot boundaries.

28. The system of claim 16 wherein the multimedia bookmark can be copied.

29. The system of claim 28, wherein the multimedia bookmark can be e-mailed.

30. A method of enabling access to multimedia content having a beginning point and an intermediate point, the intermediate point starting a segment of the multimedia content that is designated by a user, the method comprising: saving content information describing the segment in a multimedia bookmark.

31. The method of claim 30 further comprising: searching for a segment that matches content information criteria.

32. The method of claim 30 further comprising: accessing the segment multimedia content matching the content information criteria.

33. A method of enabling access to multimedia content having a beginning point and an intermediate point, the intermediate point starting a segment of the multimedia content that is designated by a user, the method comprising: selecting a multimedia content from a server; playing the multimedia content downloaded from the server by a user; receiving at the server an add-bookmark command from the user; saving content information pertaining to a segment of the multimedia content designated by the user; displaying a bookmarked position of the multimedia content; searching for a multimedia file satisfying search criteria of content information; accessing multimedia content starting from the segment having content information matching the search criteria.

34. The system of claim 33, wherein the multimedia content is partial data related to a particular at least one segment.

35. The system of claim 33, wherein the multimedia content is visual data comprising one or more frames of video.

36. The system of claim 33, wherein the multimedia content is audio data.

37. The system of claim 33, wherein the multimedia content is a string of characters.

38. The system of claim 33, wherein the multimedia bookmark further comprises positional information about the segment.

39. The system of claim 38, wherein the positional information is a URI.

40. The system of claim 38, wherein the positional information includes an elapsed time.

41. The system of claim 38, wherein the positional information includes a time code.

42. The system of claim 41, wherein the multimedia file is contained on local storage.

43. The system of claim 42, wherein the local storage includes a database.

44. The system of claim 33, wherein the multimedia file is stored on a device accessible via a network.

45. The system of claim 44, wherein the network is the Internet.

46. A method for virtual editing multimedia files, the method comprising: providing one or more video files; creating a metadata file for each of the video files, each of the metadata files having at least one segment to be edited; and creating a single edited metafile containing the segments to be edited from each of the metadata files; wherein when the edited metadata file is accessed, the user is able to play the segments to be edited in the edited order.

47. A method for virtual editing multimedia files, the method comprising: providing one or more video files; creating a metadata file for each of the video files, each of the metadata files having at least one segment to be edited; and creating a single edited metafile containing links to the segments to be edited from each of the metadata files in an edited order; wherein when the edited metadata file is accessed, the user is able to play the segments to be edited in the edited order.

48. A method for editing a multimedia file comprising: providing a metafile, the metafile having at least one segment that is selectable; selecting a segment in the metafile; determining if a composing segment should be created, and if the composing segment should be created, then creating a composing segment in a hierarchical structure; specifying the composing segment as a child of a parent composing segment; determining if metadata is to be copied or if a URI is to be used; if the metadata is to be copied, then copying metadata of the selected segment to the component segment; if the URI is to be used, then writing a URI of the selected segment to the component segment; writing a URL of an input video file to the component segment; determining if all URLs of any sibling files are the same; and if the URL is the same as any of the sibling's URLs, then writing the URL to the parent composing segment and deleting the URLs of all sibling segments.

49. The method of claim 48, the method further comprising: determining if another segment is to be selected; and if another segment is to be selected, then performing the step of selecting a segment in a metafile.

50. The method of claim 49, the method further comprising: determining if another metafile is to be browsed; and if another metafile is to be browsed, then performing the step of providing a metafile.

51. The method of claim 46 wherein the metafile is an XML file.

52. The method of claim 47 wherein the metafile is an XML file.

53. The method of claim 48 wherein the metafile is an XML file.

54. A virtual video editor comprising: a network controller, the network controller constructed and arranged to access remote metafiles and remote video files; a file controller, the file controller in operative connection to the network controller, the file controller constructed and arranged to access local metafiles and local video files, and to access the remote metafiles and the remote video files via the network controller; a parser, the parser constructed and arranged to receive information about the files from the file controller; an input buffer, the input buffer constructed and arranged to receive parser information from the parser; a structure manager, the structure manager constructed and arranged to provide structure data to the input buffer; a composing buffer, the composing buffer constructed and arranged to receive input information from the input buffer and structure information from the structure manager to generate composing information; and a generator, the generator constructed and arranged to receive the composing information from the composing buffer; wherein the generator generates output information in a pre-selected format.

55. The virtual video editor of claim 54, the editor further comprising: a playlist generator, the playlist generator constructed and arranged to receive structure information from the structure manager in order to generate playlist information; and a video player, the video player constructed and arranged to receive the playlist information from the playlist generator and file information from the file controller in order to generate display information.

56. The virtual video editor of claim 55, the editor further having a display device constructed and arranged to receive the display information from the video player and to display the display information to a user.

57. A method for transcoding an image for display at multiple resolutions, the method comprising: providing a multimedia file; designating one or more regions of the multimedia file as focus zones; providing a vector to each of the focus zones; reading the multimedia file with a client device, the client device having a maximum display resolution; determining if the resolution of the multimedia file exceeds the maximum display resolution of the client device; if the multimedia file resolution exceeds the maximum display resolution of the display device, then determining the maximum number focus zones can be displayed on the client device; and displaying the maximum number of focus zones on the client device.

58. A method for searching for relevant multimedia content based on at least one feature saved in a multimedia bookmark, the method comprising: transmitting at least one feature saved in a multimedia bookmark from a client system to a server system in response to user selection of the multimedia bookmark; generating a query for each feature saved in the multimedia bookmark and received by the server system; searching one or more storage devices using each query generated; and presenting, to the user, search results produced from at least one storage device search.

59. The method of claim 58 further comprising: transmitting image data saved in the multimedia bookmark; and using the image data as a query frame for a frame based search of the storage devices.

60. The method of claim 58 farther comprising: transmitting positional information saved in the multimedia bookmark; and using the positional information as a query frame for a frame based search of the storage devices.

61. The method of claim 58 further comprising: transmitting annotated text saved in the multimedia bookmark; and using the annotated text as keywords for a text based search of the storage devices.

62. The method of claim 58 further comprising: determining whether one or more of the search results contains annotated text; using the annotated text as keywords in a text based search for relevant multimedia content; and presenting, to the user, search results from the text based search.

63. The method of claim 58 further comprising the at least one feature saved in a multimedia bookmark including image data and annotated text.

64. The method of claim 58 further comprising the at least one feature saved in a multimedia bookmark including image data and positional information.

65. The method of claim 58 further comprising the at least one feature saved in a multimedia bookmark including positional information and annotated text.

66. The method of claim 58 further comprising the at least one feature saved in a multimedia bookmark including image data, annotated text and positional data.

67. A method for sending a multimedia bookmark between devices over a wireless network, the method comprising: submitting a multimedia bookmark to a video bookmark message service center by a sending device; acknowledging receipt of the multimedia bookmark by the video bookmark message service center to the sending device; requesting routing information for a recipient device from a home location register by the video bookmark message service center; receiving the routing information from the home location register by the video bookmark message service center; invoking a send multimedia bookmark at a mobile switching center; sending the multimedia bookmark to a recipient device by the mobile switching center; acknowledging receipt of the multimedia bookmark by the recipient device; and notifying the video bookmark message service center when the multimedia bookmark has been received by the recipient device.

68. The method of claim 67 further comprising the sending and recipient devices including wireless devices.

69. A method for sending multimedia content to a mobile device for playback over a wireless network, the method comprising: submitting a multimedia bookmark and a request for multimedia content playback from the mobile device to a mobile switching center; sending the multimedia bookmark and the request for playback to a video bookmark message service center by the mobile switching center; determining a bit rate suitable for transmission of the multimedia content to the mobile device by the video bookmark message service center; calculating a new multimedia bookmark based on the transmission bit rate and characteristics of the mobile device; sending the new multimedia bookmark to a multimedia server; and streaming the multimedia content from the multimedia server to the video bookmark message service center before delivering the multimedia content to the mobile device via the mobile switching center.

70. The method of claim 69 further comprising streaming video content to a personal digital assistant.

71. A method for verifying inclusion of attachments to electronic mail messages, the method comprising: scanning the electronic mail message for at least one indicator of an attachment to be included; determining whether at least one attachment to the electronic mail message is present upon detection of the at least one indicator of an attachment to be included; and displaying a reminder to a user in the event at least one indicator of an attachment to be included is found but no attachment is determined to be present.

72. The method of claim 71 further comprising comparing contents of the electronic mail message with language settings designated by the user to determine whether at least one indicator of an attachment to be included is present.

73. A content transcoder for modifying and forwarding multimedia content maintained in one or more multimedia content databases to a wide area network for display on a requesting client device, the content transcoder comprising: a policy engine operably coupled to the multimedia content database; a content analyzer operably coupled to the policy engine and the multimedia content database; a content selection module operably coupled to the policy engine and the content analyzer; a content manipulation module operably coupled to the content selection module; a content analysis and manipulation library operably coupled to the content analyzer, the content selection module and the content manipulation module; and wherein the policy engine is operable to receive a request for multimedia content from the requesting client device via the wide area network and to receive policy information from the multimedia content database; the content analyzer is operable to retrieve multimedia content from the multimedia content database and to forward the multimedia content to the content selection module; the content selection module is operable to select portions of the multimedia content based on the policy information and information from the content analysis and manipulation library and to forward the selected portions of multimedia content to the content manipulation module; the content manipulation module is operable to modify the multimedia content for display on a requesting client device prior to transmitting the modified multimedia content over the wide area network to the requesting client device.

74. The content transcoder of claim 73 further comprising the requesting client device including a personal digital assistant (PDA).

75. The content transcoder of claim 73 further comprising the requesting client device including a laptop computer.

76. The content transcoder of claim 73 further comprising the requesting client device including a television.

77. The content transcoder of claim 73 further comprising the requesting client device including a personal computer.

78. The content transcoder of claim 73 further comprising the requesting client device including a personal data appliance.

79. The content transcoder of claim 73 further comprising the requesting client device including a mobile telephone.

80. A method of searching for multimedia content in a peer to peer environment, the method comprising: broadcasting a message from a user system to announce entrance to the peer to peer environment; acknowledging receipt of the broadcast message by one or more active nodes in the peer to peer environment; tracking the active nodes by the user system; broadcasting a query message including multimedia features to the peer to peer environment upon initiation of a search request by the user system; executing a multimedia search engine on a multimedia database included in a storage device on one or more active nodes upon receipt of the query message; and responding to the query message with a search results message including a listing of found filenames and network address locations.

Description



BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The present invention relates generally to marking multimedia files. More specifically, the present invention relates to applying or inserting tags into multimedia files for indexing and searching, as well as for editing portions of multimedia files, all to facilitate the storing, searching, and retrieving of the multimedia information.

[0003] 2. Background of the Related Art

[0004] 1. Multimedia Bookmarks

[0005] With the phenomenal growth of the Internet, the amount of multimedia content that can be accessed by the public has virtually exploded. There are occasions where a user who once accessed particular multimedia content needs or desires to access the content again at a later time, possibly at or from a different place. For example, in the case of data interruption due to a poor network condition, the user may be required to access the content again. In another case, a user who once viewed multimedia content at work may want to continue to view the content at home. Most users would want to restart accessing the content from the point where they had left off. Moreover, subsequent access may be initiated by a different user in an exchange of information between users. Unfortunately, multimedia content is represented in a streaming file format so that a user has to view the file from the beginning in order to look for the exact point where the first user left off.

[0006] In order to save the time involved in browsing the data from the beginning, the concept of a bookmark may be used. A conventional bookmark marks a document such as a static web page for later retrieval by saving a link (address) to the document. For example, Internet browsers support a bookmark facility by saving an address called a Uniform Resource Identifier (URI) to a particular file. Internet Explorer, manufactured by the Microsoft Corporation of Redmond, Wash., uses the term "favorite" to describe a similar concept.

[0007] Conventional bookmarks, however, store only the information related to the location of a file, such as the directory name with a file name, a Universal Resource Locator (URL), or the URI. The files referred to by conventional bookmarks are treated in the same way regardless of the data formats for storing the content. Typically, a simple link is used for multimedia content also. For example, to link to a multimedia content file through the Internet, a URI is used. Each time the file is revisited using the bookmark, the multimedia content associated with the bookmark is always played from the beginning.

[0008] FIG. 1 illustrates a list 108 of conventional bookmarks 110, each comprising positional information 112 and title 114. The positional information 112 of a conventional bookmark is composed of a URI as well as a bookmarked position 106. The bookmarked position is a relative time or byte position measured from a beginning of the multimedia content. The title 114 can be specified by a user, as well as delivered with the content, and it is typically used to make the user easily recognize the bookmarked URI in a bookmark list 108. For the case of a conventional bookmark without using a bookmarked position, when a user wants to replay the specified multimedia file, the file is played from the beginning of the file each time, regardless of how much of the file the user has already viewed. The user has no choice but to record the last accessed position on a memo and to move manually the last stopped point. If the multimedia file is viewed by streaming, the user must go through a series of buffering to find out the last accessed position, thus wasting much time. Even for the conventional bookmark with a bookmarked position, the same problem occurs when the multimedia content is delivered in live broadcast, since the bookmarked position within the multimedia content is not usually available, as well as when the user wants to replay one of the variations of the bookmarked multimedia content.

[0009] Further, conventional bookmarks do not provide a convenient way of switching between different data formats. Multimedia content may be generated and stored in a variety of formats. For example, video may be stored in the formats such as MPEG, ASF, RM, MOV, and AVI. Audio may be stored in the formats such as MID, MP3, and WAV. There may be occasions where a user wants to switch the play of content from one format to another. Since different data formats produced from the same multimedia content are often encoded independently, the same segment is stored at different temporal positions within the different formats. Since conventional bookmarks have no facility to store any content information, users have no choice but to review the multimedia content from the beginning and to search manually for the last-accessed segment within the content.

[0010] Time information may be incorporated into a bookmark to return to the last-accessed segment within the multimedia content. The use of time information only, however, fails to return to exactly the same segment at a later time for the following reasons. If a bookmark incorporating time information was used to save the last-accessed segment during the preview of multimedia content broadcast, the bookmark information would not be valid during a regular fill-version broadcast, so as to return to the last-accessed segment. Similarly, if a bookmark incorporating time information was used to save the last-accessed segment during real-time broadcast, the bookmark would not be effective during later access because the later available version may have been edited or a time code was not available during the real-time broadcast.

[0011] Many video and audio archiving systems, consisting of several differently compressed files called "variations", could be produced from a single source multimedia content. Many web-casting sites provide multiple streaming files for a single video content with different bandwidths according to each video format. For example, CNN.com provides five different streaming videos for a single video content: two different types of streaming videos with the bandwidths of 28.8 kbps and 80 kbps, both encoded in Microsoft's Advanced Streaming Format (ASF). CNN.com also provides RM streaming format by RealNetworks, Inc. of Seattle, Wash. (RM), and a streaming video with the smart bandwidth encoded in Apple Computer, Inc.'s QuickTime streaming format (MOV). In this case, the five video files may start and end at different time points from the viewpoint of the source video content, since each variation may be produced by an independent encoding process varying the values chosen for encoding formats, bandwidths, resolutions, etc. This results in mismatches of time points because a specific time point of the source video content may be presented as different media time points in the five video files.

[0012] When a multimedia bookmark is utilized, the mismatches of positions cause a problem of mis-positioned playback. Consider a simple case where one makes a multimedia bookmark on a master file of a multimedia content (for example, video encoded in a given format), and tries to play another variation (for example, video encoded in a different format) from the bookmarked position. If the two variations do not start at the same position of the source content, the playback will not start at the bookmarked position. That is, the playback will start at the position that is temporally shifted with the difference between the start positions of the two variations.

[0013] The entire multimedia presentation is often lengthy. However, there are frequent occasions when the presentation is interrupted, voluntarily or forcibly, to terminate before finishing. Examples include a user who starts playing a video at work leaves the office and desires to continue watching the video at home, or a user who may be forced to stop watching the video and log out due to system shutdown. It is thus necessary to save the termination position of the multimedia file into persistent storage in order to return directly to the point of termination without a time-consuming playback of the multimedia file from the beginning.

[0014] The interrupted presentation of the multimedia file will usually resume exactly at the previously saved terminated position. However, in some cases, it is desirable to begin the playback of the multimedia file a certain time before the terminated point, since such rewinding could help refresh the user's memory.

[0015] In the prior art, the EPG (Electronic Program Guide) has played a crucial role as a provider of TV programming information. EPG facilitates a user's efforts to search for TV programs that he or she wants to view. However, EPG's two-dimensional presentation (channels vs. time slots) becomes cumbersome as terrestrial, cable, and satellite systems send out thousands of programs through hundreds of channels. Navigation through a large table of rows and columns in order to search for desired programs is frustrating.

[0016] One of the features provided by the recent set-top box (STB) is the personal video recording (PVR) that allows simultaneous recording and playback. Such STB usually contains digital video encoder/decoder based on an international digital video compression standard such as MPEG-1/2, as well as the large local storage for the digitally compressed video data. Some of the recent STBs also allow connection to the Internet. Thus, STB users can experience new services such as time-shifting and web-enhanced television (TV).

[0017] However, there still exist some problems for the PVR-enabled STBs. The first problem is that even the latest STBs alone cannot fully satisfy users' ever-increasing desire for diverse functionalities. The STBs now on the market are very limited in terms of computing and memory and so it is not easy to execute most CPU and memory intensive applications. For example, the people who are bored with plain playback of the recorded video may desire more advanced features such as video browsing/summary and search. Actually, all of those features require metadata for the recorded video. The metadata are usually the data describing content, such as the title, genre and summary of a television program. The metadata also include audiovisual characteristic data such as raw image data corresponding to a specific frame of the video stream. Some of the description is structured around "segments" that represent spatial, temporal or spatio-temporal components of the audio-visual content. In the case of video content, the segment may be a single frame, a single shot consisting of successive frames, or a group of several successive shots. Each segment may be described by some elementary semantic information using texts. The segment is referenced by the metadata using media locators such as frame number or time codes. However, the generation of such video metadata usually requires intensive computation and a human operator's help, so practically speaking, it is not feasible to generate the metadata in the current STB. Thus, one possible solution for this problem is to generate the metadata in the server connected to the STB and to deliver it to the STB via network. However, in this scenario, it is essential to know the start position of recorded video with respect to the video stream used to generate the metadata in the server/content provider in order to match the temporal position referenced by the metadata to the position of the recorded video.

[0018] The second problem is related to discrepancy between the two time instants: the time instant at which the STB starts the recording of the user-requested TV program, and the time instant at which the TV program is actually broadcast. Suppose, for instance, that a user initiated PVR request for a TV program scheduled to go on the air at 11:30 AM, but the actual broadcasting time is 11:31 AM. In this case, when the user wants to play the recorded program, the user has to watch the unwanted segment at the beginning of the recorded video, which lasts for one minute. This time mismatch could bring some inconvenience to the user who wants to view only the requested program. However, the time mismatch problem can be solved by using metadata delivered from the server, for example, reference frames/segment representing the beginning of the TV program. The exact location of the TV program, then, can be easily found by simply matching the reference frames with all the recorded frames for the program.

[0019] 2. Search

[0020] The rapid expansion of the World Wide Web (WWW) and mobile communications has also brought great interest in efficient multimedia data search, browsing and management. Content-based image retrieval (CBIR) is a powerful concept for finding images based on image contents, and content-based image search and browsing have been tested using many CBIR systems. See, M. Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Q. Huang, Byron Dom, Monika Gorkani, Jim Hafine, Denis Lee, Dragutin Petkovic, David Steele and Peter Yanker, "Query by image and video content: The QBIC system," IEEE Computer, Vol. 28. No. 9, pp. 23-32, September, 1995; Carson, Chad et al., "Region-Based Image Querying [Blobworld]," Workshop on Content-Based Access of Image and Video Libraries, Puerto Rico, June 1997; J. R. Smith and S. Chang, "Visually searching the web for content, " IEEE Multimedia Magazine, Vol. 4, No. 3, pp. 12-20, Summer 1997, also Columbia U. CU/CTR Technical Report 459-96-25; A. Pentland, R. W. Picard and S. Sclaroff, "A Photobook: tools for content-based manipulation of image databases," in Proc. Of SPIE Conf. On Storage and Retrieval for Image and Video Databases-II, No. 2185, pp. 34-47, San Jose, Calif., February, 1944; J. R. Bach, C. Fuller, A. Guppy, A. Hampapur, B. Horowitz, R. Humphrey, R. C. Jain and C. Shu, "Virage image search engine: an open framework for image management, " Symposium on Electronic Imaging: Science and Technology--Storage & Retrieval for Image and Video Databases IV, IS&T/SPIE'96, February, 1996; J. R. Smith and S. Chang, "VisualSEEk: A Fully Automated Content-Based Image Query System," ACM Multimedia Conference, Boston, Mass., November, 1996; Jing Huang, S. Ravi Kumar, Mandar Mitra, Wei-Jing Zhu and Ramin Zabih. "Image Indexing Using Color Correlograms," in IEEE Conference on Computer Vision and Pattern Recognition, pp. 762-768, June, 1997; and Simone Santini, and Ramesh Jain, "The `El Nino` Image Database System," in International Conference on Multimedia Computing and Systems, pp. 524-529, June, 1999.

[0021] Currently, most of the content-based image search engines rely on low-level image features such as color, texture and shape. While high-level image descriptors are potentially more intuitive for common users, the derivation of high-level descriptors is still in its experimental stages in the field of computer vision and requires complex vision processing. Despite its efficiency and ease of implementation, on the other hand, the main disadvantage of low-level image features is that they are perceptually non-intuitive for both expert and non-expert users, and therefor, do not normally represent users' intent effectively. Furthermore, they are highly sensitive to a small amount of image variation in feature shape, size, position, orientation, brightness and color. Perceptually similar images are often highly dissimilar in terms of low-level image features. Searches made by low-level features are often unsuccessful and it usually takes many trials to find images satisfactory to a user.

[0022] Efforts have been made to overcome the limitations of low-level features. Relevance feedback is a popular idea for incorporating user's perceptual feedback in the image search. See, Y. Rui, T. Huang, and S. Mehrota, "A relevance feedback architecture in content-based multimedia information retrieval systems," in IEEE Workshop on Content-based Access of Image and Video Libraries, Puerto Rico, pp. 82-89, June, 1997; Yong Rui, Thomas S. Huang, Michael Ortega, and Sharad Mehrotra, "Relevance Feedback: A Power Tool in Interactive Content-Based Image Retrieval," in IEEE Tran on Circuits and Systems for Video Technology, Special Issue on Segmentation, Description, and Retrieval of Video Content, pp. 644-655, Vol. 8, No. 5, September, 1998; G. Aggarwal, P. Dubey, S. Ghosal, A. Kulshreshtha, and A. Sarkar, "iPURE: perceptual and user-friendly retrieval of images," in Proc. of IEEE International Conference on Multimedia and Exposition, Vol. 2, pp. 693-696, July, 2000; Ye Lu, Chunhui Hu, Xingquan Zhu, HongJiang Zhang and Qiang Yang, "A unified framework for semantics and feature based relevance feedback in image retrieval systems," in Proc. of ACM International Conference on Multimedia, pp. 31-37, October, 2000; H. Muller, W. Muller, S. Marchand-Maillet, and T. Pun, "Strategies for positive and negative relevance feedback in image retrieval," in Proc. of IEEE Conference on Pattern Recognition, Vol. 1, pp. 1043-1046, September, 2000; S. Aksoy, R. M. Haralick, F. A. Cheikh, and M. Gabbouj, "A weighted distance approach to relevance feedback," in Proc. of IEEE Conference on Pattern Recognition, Vol. 4, pp. 812-815, September, 2000.; I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P. N. Yianilos, "The Bayesian image retrieval system, PicHunter:theory, implementation, and psychophysical experiments," in IEEE Transaction on Image Processing, Vol. 9, pp. 20-37, January, 2000; P. Muneesawang, and Guan Ling, "Multi-resolution-histogram indexing and relevance feedback learning for image retrieval," in Proc. of IEEE International Conference on Image Processing, Vol. 2, pp. 526-529, January, 2001. A user can manually establish relevance between a query and retrieved images, and the relevant images can be used for refining the query. When the refinement is made by adjusting a set of low-level feature weights, however, the user's intent is still represented by low-level features and their basic limitations still remain.

[0023] Several approaches have been made to the integration of human perceptual responses and low-level features in image retrieval. One notable approach is to adjust an image's feature's distance attributes based on the human perceptual input. See, Simone Santini, and Ramesh Jain, "The `El Nino` Image Database System," in International Conference on Multimedia Computing and Systems, pp. 524-529, June, 1999. Another approach, called "blob world," combines low-level features to derive slightly higher-level descriptions and presents the "blobs" "of grouped features to a user to provide a better understanding of feature characteristics. See, Carson, Chad, et al., "Region-Based Image Querying [Blobworld]," Workshop on Content-Based Access of Image and Video Libraries, Puerto Rico, June, 1997. While those schemes successfully reflect a user's intent to some degree, it remains to be seen how grouping of features or feature distance modification can achieve the perceptual relevance in image retrieval. A more traditional computer vision approach to the derivation of high-level object descriptors based on generic object recognition has been presented for image retrieval. See, David A. Forsyth and Margaret Fleck, "Body Plans," in IEEE Conference on Computer Vision and Pattern Recognition, pp. 678-683, June, 1997. Due to its limited feasibility for general image objects and complex processing, its utility is still restricted.

[0024] With the rapid proliferation of large image/video databases, there has been an increasing demand for effective methods to search the large image/video databases automatically by their content. For a query image/video clip given by a user, these methods search the databases for the images/videos that are most similar to the query. In other words, the goal of the image/video search is to find best matches to the query image/video from the database.

[0025] Several approaches have been made towards the development of the fast, effective multimedia search methods. Milanes et al. utilized hierarchical clustering to organize an image database into visually similar groupings. See, R. Milanese, D. Squire, and T. Pun, "Correspondence analysis and hierarchical indexing for content-based image retrieval," in Proc. IEEE Int. Conf. Image Processing, Vol. 3, Lausanne, Switzerland, pp. 859-862, September, 1996. Zhang and Zhong provided a hierarchical self-organizing map (HSOM) method to organize an image database into a two-dimensional grid. See, H. J. Zhang and D. Zhong, "A scheme for visual feature based image indexing," in Proc. SPIE/IS&T Conf. Storage Retrieval Image Video Database III, Vol. 2420, pp. 36-46, San Jose, Calif., February, 1995. However, a weakness of HSOM is that it is generally too computationally expensive to apply to a large multimedia database.

[0026] In addition, there are other well known solutions using Voronoi diagram, Kd-tree, and R-tree. See, J. Bentley, "Multidimensional binary search trees used for associative searching," Comm. of the ACM, Vol. 18, No. 9, pp. 509-517, 1975; S. Brin, "Near neighbor search in large metric spaces," in Proc. 21.sup.st Conf. On Very Large Databases (VLDB '95), Zurich, Switzerland, pp. 574-584, 1995. However, it is also known that those approaches are not adequate for the high dimensional feature vector spaces, and thus, they are useful only in low dimensional feature spaces.

[0027] Peer to Peer Searching

[0028] Peer-to-Peer (P2P) is a class of applications making the most of previously unused resources (for example, storage, content, and/or CPU cycles), which are available on the peers at the edges of networks. P2P computing allows the peers to share the resources and services, or to aggregate CPU cycles, or to chat with each other, by direct exchange. Two of the more popular implementations of P2P computing are Napster and Gnutella. Napster has its peers register files with a broker, and uses the broker to search for files to copy. The broker plays the role of server in a client-server model to facilitate the interaction between the peers. Gnutella has peers register files with network neighbors, and searches the P2P network for files to copy. Since this model does not require a centralized broker, Gnutella is considered to be a true P2P system.

[0029] 3. Editing

[0030] In the prior art, video files were edited through video editing software by copying several segments of the input videos and pasting them to an output video. The prior art method, however, confronts two major problems mentioned below.

[0031] The first problem of the prior art method is that it requires additional storage to store the new version of an edited video file. Conventional video editing software generally uses the original input video file to create an edited video. In most of the cases, editors having a large database of videos attempt to edit the videos to create a new one. In this case, the storage is wasted storing duplicated portions of the video. The second problem with the prior art method is that a whole new metadata have to be generated for a newly created video. If the metadata are not edited in accordance with the edition of the video, even if the metadata for the specific segment of the input video are already constructed, the metadata may not accurately reflect the content. Because considerable effort is required to create the metadata of videos, it is desirable to reuse efficiently existing metadata, if possible.

[0032] Metadata of a video segment contain textual information such as time information (for example, starting frame number and duration, or starting frame number as well as the finishing frame number), title, keyword, and annotation, as well as image information such as the key frame of a segment. The metadata of segments can form a hierarchical structure where the larger segment contains the smaller segments. Because it is hard to store both the video and their metadata into a single file, the video metadata are separately stored as a metafile, or stored in a database management system (DBMS).

[0033] If metadata having a hierarchical structure are used, browsing a whole video, searching for a segment using the keyword and annotation of each segment, and using the key frames of each segment for visual summary of the video are supported. Also, not only does it support the existing simple playback, but also the playback and repeated playback of a specific segment. Therefor, the use of hierarchically-structured metadata is becoming popular.

[0034] 4. Transcoding

[0035] With the advance of information technology, such as the popularity of the Internet, multimedia presentation proliferates into ever increasing kinds of media, including wireless media. Multimedia data are accessed by ever increasing kinds of devices such as hand-held computers (HHCs), personal digital assistants (PDAs), and smart cellular phones. There is a need for accessing multimedia content in a universal fashion from a wide variety of devices. See, J. R. Smith, R. Mohan and C. Li, "Transcoding Internet Content for Heterogeneous Client Devices," in Proc. ISCASA, Monterey, Calif., 1998.

[0036] Several approaches have been made to enable effectively such universal multimedia access (UMA). A data representation, the InfoPyramid, is a framework for aggregating the individual components of multimedia content with content descriptions, and methods and rules for handling the content and content descriptions. See, C. Li, R. Mohan and J. R. Smith, "Multimedia Content Description in the InfoPyramid," in Proc. IEEE Intern. Conf. on Acoustics, Speech and Signal Processing, May, 1998. The InfoPyramid describes content in different modalities, at different resolutions and at multiple abstractions. Then a transcoding tool dynamically selects the resolutions or modalities that best meet the client capabilities from the InfoPyramid. J. R. Smith proposed a notion of importance value for each of the regions of an image as a hint to reduce the overall data size in bits of the transcoded image. See, J. R. Smith, R. Mohan and C. Li, "Content-based Transcoding of Images in the Internet," in Proc. IEEE Intern. Conf. on Image Processing, October, 1998; S. Paek and J. R. Smith, "Detecting image Purpose in World-Wide Web Documents," in Proc. SPIE/IS&T Photonics West, Document Recognition, January, 1998. The importance value describes the relative importance of the region/block in the image presentation compared with the other regions. This value ranges from 0 to 1, where 1 stands for the highest important region and 0 for the lowest. For example, the regions of high importance are compressed with a lower compression factor than the remaining part of the image. Then, the other parts of the image are first blurred and then compressed with a higher compression factor in order to reduce the overall data size of the compressed image.

[0037] When an image is transmitted to a variety of client devices with different display sizes, a scaling mechanism, such as format/resolution change, bit-wise data size reduction, and object dropping, is needed. More specifically, when an image is transmitted to a variety of client devices with different display sizes, a system should generate a transcoded (e.g., scaled and cropped) image to fit the size of the respective client display. The extent of transcoding depends on the type of objects embedded in the image, such as cards, bridges, face, and so forth. Consider, for example, an image containing an embedded text or a human face. If the display size of a client device is smaller than the size of the image, sub-sampling and/or cropping to fit the client display must reduce the spatial resolution of the image. Users very often in such a case have difficulty in recognizing the text or the human face due to the excessive resolution reduction. Although the importance value may be used to provide information on which part of the image can be cropped, it does not provide a quantified measure of perceptibility indicating the degree of allowable transcoding. For example, the prior art does not provide the quantitative information on the allowable compression factor with which the important regions can be compressed while preserving the minimum fidelity that an author or a publisher intended. The InfoPyramid does not provide either the quantitative information about how much the spatial resolution of the image can be reduced or ensure that the user will perceive the transcoded image as the author or publisher initially intended.

[0038] 5. Visual Rhythm

[0039] Fast construction of visual rhythm

[0040] Once the digital video is indexed, more manageable and efficient forms of retrieval may be developed based on the index that facilitate storage and retrieval. Generally, the first step for indexing and retrieving of visual data is to temporally segment the input video, that is, to find shot boundaries due to camera shot transitions. The temporally segmented shots can improve the storing and retrieving of visual data if keywords to the shots are also available. Therefor, a fast and accurate automatic shot detector needs to be developed as well as an automatic text caption detector to automatically annotate keywords to the temporally segmented shots.

[0041] Even if abrupt scene changes are relatively easy to detect, it is more difficult to identify special effects, such as dissolve and wipe. Unfortunately, these special effects are normally used to stress the importance of the scene change (from a content point of view), so they are extremely relevant therefor they should not be missed. However, the wipe sequence detection method, relative to dissolve sequence, is less discussed and concerned. For scene change detection, a matching process between two consecutive frames is required. In order to segment a video sequence into shots a dissimilarity measure between two frames must be defined. This measure must return a high value only when two frames fall in different shots. Several researchers have used the dissimilarity measure based on the luminance or color histogram, correlogram, or any other visual feature to match two frames. However, these approaches usually produce many false alarms and it is very hard for humans to exactly locate various types of shots (especially dissolves and wipes) of a given video even when the dissimilarity measure between two frames are plotted, for example when they are plotted in 1-D graph where the horizontal axis represents time of a video sequence and the vertical axis represents the dissimilarity values between the histograms of the frames along time. They also require high computation load to handle different shapes, directions and patterns of various wipe effects. Therefor, it is important to develop a tool that enables human operator to efficiently verify the results of automatic shot detection where there usually might be many falsely detected and missing shots. Visual rhythm satisfies much of the above conditions.

[0042] Visual rhythm contains distinctive patterns or visual features for many type of video editing effects, especially for all wipe-like effects which manifest as visually distinguishable lines or curves on the visual rhythm with very little computational time, which enables an easy verification of automatically detected shots by human without actually playing the whole individual frame sequence to minimize or possible eliminate all false as well as missing shots. Visual rhythm on the other hand contains visual features readily available to detect caption text also. See, H. Kim, J. Lee and S. M. Song, "An efficient graphical shot verifier incorporating visual rhythm", in Proceedings of IEEE International Conference on Multimedia Computing and Systems, pp. 827-834, June, 1999.

[0043] Detecting Text in Video and Graphic Images

[0044] As contents become readily available on wide area networks such as the Internet, archiving, searching, indexing and locating desired content in large volumes of multimedia containing image and video, in addition to the text information, will become even more difficult. One important source of information about image and video is the text contained therein. The video can be easily indexed if access to this textual information content is available. The text provides clear semantics of video and are extremely useful in deducing the contents of video.

[0045] There are many ways that segment and recognize text in printed documents. Current video research tackles the text caption recognition problem as a series of sub-problems to: (a) identify the existence and location of text captions in complex background; (b) segment text regions; and (c) post-process the text regions for recognition using a standard OCR. Most current research focuses on tackling sub-problems (a) and (b) in raw spatial domain, with a few methods that can be extended to compressed domain processing.

[0046] A large number of methods has been studied extensively in recent years to detect text frames in uncompressed images and video. Ohya et al. performed character extraction through local thresholding and detected character candidate regions by evaluating gray level differences between adjacent regions. See, J. Ohya, A, Shio and S. Akamatsu, "Recognizing Characters in Scene Image," in IEEE Trans. On pattern Analysis and Machine Intelligence, Vol. 16, pp. 214-224. Haupmann and Smith used the spatial context of text and high contrast of text regions in scene images to merge large numbers of horizontal and vertical edges in spatial proximity to detect text. See, A. Haupmann, M. Smith, "Text, Speech, and Vision for Video Segmentation: The Informedia Project," in AAAI Symposium on Computational Models for Integrating Language and Vision, 1995. Shim et al. introduced a generalized region labeling algorithm to find homogeneous regions for text extraction. See, J. Shim, C. Dorai and M. Smith, "Automatic Text Extraction from Video for Content-Based Annotation and Retrieval," in Proc. ICPR, pp. 618-620, 1998. Manmatha showed the algorithm to detect and segment texts as regions of distinctive texture using pyramid technique for handling text fonts of different sizes. See, W. Manmatha, "Finding Text in Images," in Proc. of ACM Int'l Conf. On Digital Libraries, 3-12. Lienhart and Stuber provided Split-and-Merge algorithm based on characteristics of artificial text to segment text. See, R. Lienhart, "Automatic Text Recognition for Video Indexing," in Proc. Of ACM MM, pp. 11-20. Doermann and Kia used wavelet analysis and employed a multi-frame coherence approach to cluster edges into rectangular shape. See, L. Doermann, 0. Kia, "Automatic Text Detection and Tracking in Digital Video," in IEEE Trans. On Image Processing, Vol. 9, pp. 147-156. Sato et al. adopted a multi-frame integration technique to separate static text from moving background. See, T. Sato, T. Kanade and S. Satoh, "Video OCR: Indexing Digital News Libraries by Recognition of Superimposed Captions," in Multimedia Systems, Vol. 7, pp. 385-394.

[0047] Finally, several compressed domain methods have also been proposed to detect text regions. Yeo and Liu proposed a method for the detection of text caption events in video by modified scene change detection which cannot handle captions that gradually enter or disappear from frames. See, B. L. Yeo, "Visual Content Highlighting Visa Automatic Extraction of Embedded Captions on MPEG Compressed Video," in SPIE/IS&T Symp. on Electronic Imaging Science and Technology, Vol. 2668, 1996. Zhong et al. examined the horizontal variations of AC values in DCT to locate text frames and examined the vertical intensity variation within the text regions to extract the final text frames. See, Y. Zhong, K. Karu and A. Jain, "Automatic captions localization in compressed video," in IEEE Trans. On PAMI, 22(4), pp. 385-392. Zhong derived a binarized gradient energy representation directly from DCT coefficients which are subject to constraints on text properties and temporal coherence to locate text. See, Y. Zhong, "Detection of text captions in compressed domain video," in Proc. Of Multimedia Information Retrieval Workshop ACM Multimedia'2000, November 201-204. However, most of the compressed domain methods restrict the detection of text in I-frames of a video because it is time-consuming to obtain the AC values in DCT for intra-frame coded frames.

[0048] There is, therefor, a need in the art for a method and system that will enable the tagging of multimedia images for indexing, editing, searching and retrieving. There is also a need in the art to enable the indexing of textual information that is embedded in graphical images or other multimedia data so that the text in the image can also be tagged, indexed, searched and retrieved, as is other textual information. Further, there is also a need in the art for editing multimedia data for display, indexing, and searching in ways the prior art does not provide.

Summary of the Invention

[0049] The invention overcomes the above-identified problems as well as other shortcomings and deficiencies of existing technologies by providing

[0050] 1. Multimedia Bookmark

[0051] The present invention provides a system and method for accessing multimedia content stored in a multimedia file having a beginning and an intermediate point, the content having at least one segment at the intermediate point. At a minimum, the system includes a multimedia bookmark, the multimedia bookmark having content information about the segment at the intermediate point, wherein a user can utilize the multimedia bookmark to access the segment without accessing the beginning of the multimedia file.

[0052] The system of the present invention can include a wide area network such as the Internet. Moreover, the method of the present invention can facilitate the creating, storing, indexing, searching, retrieving and rendering of multimedia information on any device capable of connecting to the network and performing one or more of the aforementioned functions. The video content can be one or more frames of video, audio data, text data such as a string of characters, or any combination or permutation thereof.

[0053] The system of the present invention includes a search mechanism that locates a segment in the multimedia file. An access mechanism is included in the system that reads the multimedia content at the segment designated by the multimedia bookmark. The multimedia content can be partial data that are related to a particular segment.

[0054] The multimedia bookmark used in conjunction with the system of the present invention includes positional information about the segment. The positional information can be a URI, an elapsed time, a time code, or other information. While the multimedia file used in conjunction with the system of the present invention can be contained on local storage, it can also be stored at remote locations.

[0055] The system of the present invention can be a computer server that is operably connected to a network that has connected to it one or more client devices. Local storage on the server can optionally include a database and sufficient circuitry and/or logic, in the form of hardware and/or software in any combination that facilitates the storing, indexing, searching, retrieving and/or rendering of multimedia information.

[0056] The present invention further provides a methodology and implementation for adaptive refresh rewinding, as opposed to traditional rewinding, which simply performs a rewind from a particular position by a predetermined length. For simplicity, the exemplary embodiment described below will demonstrate the present invention using video data. Three essential parameters are identified to control the behavior of adaptive refresh rewinding, that is, how far to rewind, how to select certain frames in the rewind interval, and how to present the chosen refresh video frames on a display device.

[0057] The present invention also provides a new way to generate and deliver programming information that is customized to the user's viewing preferences. This embodiment of the present invention removes the navigational difficulties associated with EPG. Specifically, data regarding the user's habits of recording, scheduling, and/or accessing TV programs or Internet movies are captured and stored. Over a long period of time, these data can be analyzed and used to determine the user's trends or patterns that can be used to predict future viewing preferences.

[0058] The present invention also relates to the techniques to solve the two problems by downloading the metadata from a distant metadata server and then synchronizing/matching the content with the received metadata. While this invention is described in the context of video content stored on STB having PVR function, it can be extended to other multimedia content such as audio.

[0059] The present invention also allows the reuse of the content prerecorded on the analog VCR videotapes. Using the PVR function of STB, once the content of the VCR tape is converted into digital video and is stored on the hard disk on the STB, the present invention works equally well.

[0060] The present invention also provides a method for searching for relevant multimedia content based on at least one feature saved in a multimedia bookmark. The method preferably includes transmitting at least one feature saved in a multimedia bookmark from a client system to a server system in response to a user's selection of the multimedia bookmark. The server may then generate a query for each feature received and, subsequently, use each query generated to search one or more storage devices. The search results may be presented to the user upon completion.

[0061] In yet another embodiment, the present invention provides a method for verifying inclusion of attachments to electronic mail messages. The method preferably includes scanning the electronic mail message for at least one indicator of an attachment to be included and determining whether at least one attachment to the electronic mail message is present upon detection of the at least one indicator. In the event an indicator is present but an attachment is not, the method preferably also includes displaying a reminder to a user that no attachment is present.

[0062] In yet another embodiment, the present invention provides a method for searching for multimedia content in a peer to peer environment. The method preferably includes broadcasting a message from a user system to announce its entrance to the peer to peer environment. Active nodes in the peer to peer environment preferably acknowledge receipt of the broadcast message while the user system preferably tracks the active nodes. Upon initiation of a search request at the user system, a query message including multimedia features is preferably broadcast to the peer to peer environment. Upon receipt of the query message, a multimedia search engine on a multimedia database included in a storage device on one or more active nodes is preferably executed. A search results message including a listing of found filenames and network locations is preferably sent to the user system upon completion of the database search.

[0063] The present invention further provides a method for sending a multimedia bookmark between devices over a wireless network. The method preferably includes acknowledging receipt of a multimedia bookmark by a video bookmark message service center upon receipt of the multimedia bookmark from a sending device. After requesting and receiving routing information from a home location register, the video bookmark message service center preferably invokes a send multimedia bookmark operation at a mobile switching center. The mobile switching center then preferably sends the multimedia bookmark and, upon acknowledgement of receipt of the multimedia bookmark by the recipient device, notifies the video bookmark message service center of the completed multimedia bookmark transaction.

[0064] In another embodiment, the present invention provides a method for sending multimedia content over a wireless network for playback on a mobile device. In this embodiment, the mobile device preferably sends a multimedia bookmark and a request for playback to a mobile switching center. The mobile switching center then preferably sends the request and the multimedia bookmark to a video bookmark message service center. The video bookmark message service center then preferably determines a suitable bit rate for transmitting the multimedia content to the mobile device. Based on the bit rate and various characteristics of the mobile device, the video bookmark message service center also preferably calculates a new multimedia bookmark. The new multimedia bookmark is then sent to a multimedia server which streams the multimedia content to the video bookmark message service center before the multimedia content is delivered to the mobile device via the mobile switching center.

[0065] 2. Search

[0066] The present invention further provides a new approach to utilizing user-established relevance between images. Unlike conventional content-based and text-based approaches, the method of the present invention uses only direct links between images without relying on image descriptors such as low-level image features or textual annotations. Users provide relevance information in the form of relevance feedback, and the information is accumulated in each image's queue of links and propagated through linked images in a relevance graph. The collection of direct image links can be effective for the retrieval of subjectively similar images when they are gathered from a large number of users over a considerable period of time. The present invention can be used in conjunction with other content-based and text-based image retrieval methods.

[0067] The present invention also provides a new method to fast find from a large database of image/frames the objects close enough to a query image/frame under a certain distortion. With the metric property of distance function, the information on LBG clustering, and Haar-transform based fast codebook search algorithm, which is also disclosed herein, the present invention reduces the number of distance evaluations at query time, thus resulting in fast retrieval of data objects from the database. Specifically, the present invention sorts and stores in advance the distances to a group of predefined distinguished points (called reference points) in the feature space and performs binary searches on the distances so as to speed up the search.

[0068] The present invention introduces an abstract multidimensional structure called hypershell. More practically, the hypershell can be conceived as a set of all the feature vectors in the feature space which lie away r.+-..epsilon. from its corresponding reference point, where r is the distance between a query feature point and the reference point, and .epsilon. is a real number indicating the fidelity of search results. And the intersection of such hypershells leads to some intersected regions which are often small partitions of the whole feature space. Therefor, instead of the whole feature space, the present invention performs the search only on the intersected regions to improve the search speed.

[0069] 3. Editing

[0070] The present invention further provides a new approach to editing video materials, in which it only virtually edits the metadata of input videos to create a new video, instead of actually editing videos stored as computer files. In the present invention, the virtual editing is performed either by copying the metadata of a video segment of interest in an input metafile or copying only the URI of the segment into a newly constructed metafile. The present invention provides a way of playing the newly edited video only with its metadata. The present invention also provides a system for the virtual editing. The present invention can be applied not only to videos stored on CD-ROM, DVD, and hard disk, but also to streaming videos over a network.

[0071] The present invention also provides a method for virtual editing multimedia files. Specifically, the one or more video files are provided. A metadata file is created for each of the video files, each of the metadata files having at least one segment to be edited. Thereafter, a single edited metafile is created that contains the segments to were to be edited from each of the metadata files so that when the edited metadata file is accessed, the user is able to play the segments to be edited in the edited order.

[0072] The present invention also provides a method for virtual editing multimedia files. Specifically, the one or more video files are provided. A metadata file is created for each of the video files, each of the metadata files having at least one segment to be edited. Thereafter, a single edited metafile is created that contains links to the segments to were to be edited from each of the metadata files so that when the edited metadata file is accessed, the user is able to play the segments to be edited in the edited order.

[0073] The present invention also includes a method for editing a multimedia file by providing a metafile, the metafile having at least one segment that is selectable; selecting a segment in the metafile; determining if a composing segment should be created, and if the composing segment should be created, then creating a composing segment in a hierarchical structure; specifying the composing segment as a child of a parent composing segment; determining if metadata is to be copied or if a URI is to be used; if the metadata is to be copied, then copying metadata of the selected segment to the component segment; if the URI is to be used, then writing a URI of the selected segment to the component segment; writing a URL of an input video file to the component segment; determining if all URLs of any sibling files are the same; and if the URL is the same as any of the sibling's URLs, then writing the URL to the parent composing segment and deleting the URLs of all sibling segments.

[0074] In a further embodiment, the method for editing a multimedia file includes determining if another segment is to be selected and if another segment is to be selected, then performing the step of selecting a segment in a metafile.

[0075] In yet a further embodiment of the method for editing a multimedia file, the method includes determining if another metafile is to be browsed and if another metafile is to be browsed, then performing the step of providing a metafile. The metafiles may be XML files or some other format.

[0076] The present invention also provides a virtual video editor in one embodiment. The virtual video editor includes a network controller constructed and arranged to access remote metafiles and remote video files and a file controller in operative connection to the network controller and constructed and arranged to access local metafiles and local video files, and to access the remote metafiles and the remote video files via the network controller. A parser constructed and arranged to receive information about the files from the file controller and an input buffer constructed and arranged to receive parser information from the parser are also included in the virtual video editor. Further, a structure manager constructed and arranged to provide structure data to the input buffer, a composing buffer constructed and arranged to receive input information from the input buffer and structure information from the structure manager to generate composing information and a generator constructed and arranged to receive the composing information from the composing buffer are preferably included and wherein the generator generates output information in a pre-selected format are preferably included.

[0077] In a further embodiment, the virtual video editor also includes a playlist generator constructed and arranged to receive structure information from the structure manager in order to generate playlist information and a video player constructed and arranged to receive the playlist information from the playlist generator and file information from the file controller in order to generate display information.

[0078] In yet a further embodiment, the virtual video editor also includes a display device constructed and arranged to receive the display information from the video player and to display the display information to a user.

[0079] In a further embodiment, the present invention provides a method for transcoding an image for display at multiple resolutions. Specifically, the method includes providing a multimedia file, designating one or more regions of the multimedia file as focus zones and providing a vector to each of the focus zones. The method continues by reading the multimedia file with a client device, the client device having a maximum display resolution and determining if the resolution of the multimedia file exceeds the maximum display resolution of the client device. If the multimedia file resolution exceeds the maximum display resolution of the display device, the method determines the maximum number focus zones that can be displayed on the client device. Finally, the method includes displaying the maximum number of focus zones on the client device.

[0080] 4. Transcoding

[0081] The present invention also provides a novel scheme for generating transcoded (scaled and cropped) image to fit the size of the respective client display when an image is transmitted to a variety of client devices with different display sizes. The scheme has two key components: 1) perceptual hint for each image block, and 2) an image transcoding algorithm. For a given semantically important block in an image, the perceptual hint provides the information on the minimum allowable spatial resolution. Actually, it provides a quantitative information on how much the spatial resolution of the image can be reduced while ensuring that the user will perceive the transcoded image as the author or publisher want to represent it. The image transcoding algorithm that is basically a content adaptation process selects the best image representation to meet the client capabilities while delivering the largest content value. The content adaptation algorithm is modeled as a resource allocation problem to maximize the content value.

[0082] 5. Visual Rhythm

[0083] One of the embodiments of the method of the present invention provides a fast and efficient approach for constructing visual rhythm. Unlike the conventional approaches which decode all pixels composing a frame to obtain certain group of pixel values using conventional video decoders, the present invention provides a method such that only few of the pixels composing a frame are decoded to obtain the actual group of pixels needed for constructing visual rhythm. Most video compressions adopt intraframe and interframe coding to reduce spatial as well as temporal redundancies. Therefor, once the group of pixels is determined for constructing visual rhythm, one only decodes this group of pixels in frames which are not referenced by other frames for interframe coding. For frames referenced by other frames for interframe coding, one decodes the determined group of pixels for constructing visual rhythm as well as other few pixels needed to decode this group of pixels for frames referencing to those frames. This allows fast generation of visual rhythm for its application to shot detection, caption text detection, or any other possible applications derived from it.

[0084] The other embodiment of the method of present invention provides an efficient and fast-compressed DCT domain method to locate caption text regions in intra-coded and inter-coded frames through visual rhythm from observations that caption text generally tend to appear on certain areas on video or are known a prior; and secondly, the method employs a combination of contrast and temporal coherence information on the visual rhythm, to detect text frame and uses information obtained through visual rhythm to locate caption text regions in the detected text frame along with their temporal duration within the video.

[0085] In one embodiment of the present invention, a content transcoder for modifying and forwarding multimedia content maintained in one or more multimedia content databases to a wide area network for display on a requesting client device is provided. In this embodiment, the content transcoder preferably includes a policy engine coupled to the multimedia content database and a content analyzer operably coupled to both the policy engine and the multimedia content database. The content transcoder of the present invention also preferably includes a content selection module operably coupled to both the policy engine and the content analyzer and a content manipulation module operably coupled to the content selection module. Finally, the content transcoder preferably includes a content analysis and manipulation library operably coupled to the content analyzer, the content selection module and the content manipulation module. In operation, the policy engine may receive a request for multimedia content from the requesting client device via the wide area network and policy information from the multimedia content database. The content analyzer may retrieve multimedia content from the multimedia content database and forward the multimedia content to the content selection module. The content selection module may select portions of the multimedia content based on the policy information and information from the content analysis and manipulation library and forward the selected portions of multimedia content to the content manipulation module. The content manipulation module may then modify the multimedia content for display on the requesting client device before transmitting the modified multimedia content over the wide area network to Features and advantages of the invention will be apparent from the following description of the embodiments, given for the purpose of disclosure and taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0086] A more complete understanding of the present invention and advantages thereof may be acquired by referring to the following description taken in conjunction with the accompanying drawings, wherein:

[0087] FIG. 1 is an illustration of a conventional prior art bookmark.

[0088] FIG. 2 is an illustration of a multimedia bookmark in accordance with the present invention.

[0089] FIG. 3 is an illustration of exemplary searching for multimedia content relevant to the content information saved in the multimedia bookmark of the present invention, where both positional and content information are used.

[0090] FIG. 4 is an illustration of an exemplary tree structure used by two exemplary search methods in accordance with the present invention.

[0091] FIG. 5 is an example of five variations encoded by the present invention from the same source video content.

[0092] FIG. 6 is an example of two multimedia contents and their associated metadata of the present invention.

[0093] FIG. 7 is a list of example multimedia bookmarks of the present invention.

[0094] FIG. 8 is an illustration of an exemplary method of adjusting bookmarked positions in the durable bookmark system of the present invention.

[0095] FIG. 9 is an illustration of an exemplary user interface incorporating a multimedia bookmark of the present invention.

[0096] FIG. 10 is a flowchart illustrating an exemplary embodiment of a method of the present invention that is effective to implement the disclosed processing system.

[0097] FIG. 11 is a flowchart illustrating the overall process of saving and retrieving multimedia bookmarks of the present invention.

[0098] FIG. 12 is a flowchart illustrating an exemplary process of playing a multimedia bookmark of the present invention.

[0099] FIG. 13 is a flowchart illustrating an exemplary process of deleting a multimedia bookmark of the present invention.

[0100] FIG. 14 is a flowchart illustrating an exemplary process of adding a title to a multimedia bookmark of the present invention.

[0101] FIG. 15 is a flowchart illustrating an exemplary process of the present invention for searching for the relevant multimedia content based upon content, as well as textual information if available.

[0102] FIG. 16 is a flow chart illustrating an exemplary process of the present invention for sending a bookmark to other people via e-mail.

[0103] FIG. 17 is a flowchart illustrating an exemplary method of the present invention for e-mailing a multimedia bookmark of the present invention.

[0104] FIG. 18 is a block diagram illustrating an exemplary system for transmitting multimedia content to a mobile device using the multimedia bookmark of the present invention.

[0105] FIG. 19 is a block diagram illustrating an exemplary message signal arrangement of the present invention between a personal computer and a mobile device.

[0106] FIG. 20 is a block diagram illustrating an exemplary message signal arrangement of the present invention between two mobile devices.

[0107] FIG. 21 is a block diagram illustrating an exemplary message signal arrangement of the present invention between a video server and a mobile device.

[0108] FIG. 22 is a block diagram illustrating an exemplary data correlation method of the present invention.

[0109] FIG. 23 is a block diagram illustrating an exemplary swiping technique of the present invention.

[0110] FIG. 24 is a block diagram illustrating an alternate exemplary swiping technique of the present invention.

[0111] FIG. 25 is a flowchart illustrating an exemplary peer-to-peer exchange of the multimedia bookmark of the present invention.

[0112] FIG. 26 is a block diagram illustrating different sampling strategies.

[0113] FIG. 27 is a block diagram illustrating an exemplary visual rhythm method of the present invention.

[0114] FIG. 28 is a block diagram illustrating the localization and segmentation of text information according to the present invention.

[0115] FIG. 29 is a block diagram illustrating the use of an exemplary Haar transformation according to the present invention.

[0116] FIG. 30 is a block diagram illustrating an exemplary queue for image links of the present invention.

[0117] FIG. 31 is a block diagram illustrating an alternate exemplary queue for image links of the present invention.

[0118] FIGS. 32 (a) and (b) are block diagrams illustrating a comparison of a prior art video methodology and an exemplary editing method of the present invention.

[0119] FIG. 33 is a block diagram illustrating an exemplary segmentation and reconstruction of a new multimedia video presentation according to the method of the present invention.

[0120] FIG. 34 is a block diagram illustrating an exemplary edited multimedia file according to the present invention.

[0121] FIG. 35 is a flowchart of an exemplary method of the present invention for virtual video editing based on metadata.

[0122] FIG. 36 is an exemplary pseudocode implementation of the method of the present invention.

[0123] FIG. 37 is an exemplary pseudocode implementation of the method of the present invention.

[0124] FIG. 38 is an exemplary pseudocode implementation of the method of the present invention.

[0125] FIG. 39 is an exemplary pseudocode implementation of the method of the present invention.

[0126] FIG. 40 is an exemplary pseudocode implementation of the method of the present invention.

[0127] FIG. 41 is an exemplary pseudocode implementation of the method of the present invention.

[0128] FIG. 42 is a block diagram illustrating an exemplary virtual video editor of the present invention.

[0129] FIG. 43 is a block diagram illustrating an exemplary transcoding method of the present invention without SRR value.

[0130] FIG. 44 is a block diagram illustrating an exemplary transcoding method of the present invention with SRR value.

[0131] FIG. 45 is a block diagram illustrating an exemplary content transcoder of the present invention.

[0132] FIG. 46 is a block diagram illustrating an exemplary adaptive widow focusing method of the present invention.

[0133] FIG. 47 is a block diagram and table illustrating image nodes and edges according to an exemplary method of the present invention.

[0134] FIG. 48 is a block diagram illustrating an exemplary hypershell search method of the present invention.

[0135] FIG. 49 is a block diagram illustrating the contents of an embodiment of the video bookmark of the present invention.

[0136] FIG. 50 is a block diagram illustrating the recommendation engine of the present invention.

[0137] FIG. 51 is a block diagram illustrating the video bookmark process of the present invention in conjunction with an EPG channel.

[0138] FIG. 52 is a block diagram illustrating the video bookmark process of the present invention in conjunction with a network.

[0139] FIG. 53 is a block diagram of the system of the present invention.

[0140] FIG. 54 is a block diagram of an exemplary relevance queue of the present invention.

[0141] FIG. 55 is a timeline diagram showing an exemplary embodiment of the rewind method of the present invention.

[0142] FIG. 56 is a timeline diagram showing an exemplary embodiment of the rewind method of the present invention.

[0143] FIG. 57 is a flowchart showing an exemplary embodiment of the retrieval method of the present invention.

[0144] FIG. 58 is a flowchart showing another exemplary embodiment of the retrieval method of the present invention.

[0145] FIG. 59 is a flowchart showing another exemplary embodiment of the retrieval method of the present invention.

[0146] FIG. 60 is a block diagram illustrating a hierarchical arrangement of images that exemplifies a navigation method of the present invention.

[0147] FIG. 61 is a web page illustrating a web page having an exemplary duration bar of the present invention.

[0148] FIG. 62 is a web page illustrating a web page having an exemplary duration bar of the present invention.

[0149] FIG. 63 is a diagram illustrating an exemplary hypershell search method of the present invention.

[0150] FIG. 64 is a diagram illustrating another exemplary hypershell search method of the present invention.

[0151] FIG. 65 is a diagram illustrating another exemplary hypershell search method of the present invention.

[0152] FIG. 66 is a diagram illustrating another exemplary hypershell search method of the present invention.

[0153] FIG. 67 is a diagram illustrating another exemplary hypershell search method of the present invention.

[0154] FIG. 68 is a block diagram illustrating an exemplary embodiment of the metadata server and metadata agent of the present invention.

[0155] FIG. 69 is a block diagram illustrating an alternate exemplary embodiment of the metadata server and metadata agent of the present invention.

[0156] FIG. 70 is a timeline comparison illustrating exemplary offset recording capability of the present invention.

[0157] FIG. 71 is a timeline comparison illustrating alternate exemplary offset recording capability of the present invention.

[0158] FIG. 72 is a timeline comparison illustrating exemplary interrupt recording capability of the present invention.

[0159] FIG. 73 is a timeline comparison illustrating the exemplary disparate and sequential recording capabilities of the present invention.

[0160] While the present invention is susceptible to various modifications and alternative forms, specific exemplary embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the invention to the particular forms disclosed, but, on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

[0161] FIG. 53 illustrates the system of the present invention. At the heart of the system of the present invention is a Wide Area Network 5350, exemplary or most famously embodied in the Internet. The present invention can be contained within the server 5314, as well as a series of clients such as Laptop 5322, Video Camera 5324, Telephone 5326, Digitizing Pad 5328, Personal Digital Assistance (PDA) 5330, Television 5332, Set Top Box 5340 (that is connected to and serves Television 5338), Scanner 5334, Facsimile Machine 5336, Automobile 5302, Truck 5304, Screen 5308, Work Station 5312, Satellite Dish 5310, and Communications Tower 5306, all useful for communications to or from remote devices for use with the system of the present invention. The present invention is particularly useful for set top boxes 5340. The set top boxes 5340 may be used as intermediate video servers for home networking, serving televisions, personal computers, game stations and other appliances. The server 5314 can be connected to an internal local area network via, for example, Ethernet 5316, although any type of communications protocol in a local area network or wide area network is possible for use with the present invention. Preferably, the local area network for the server 5314
has with it connections for data storage 5318 which can include database storage capability. The local area network connected to Ethernet 5316 may also hold one or more alternate servers 5320 for purposes of load balancing, performance, etc. The multimedia bookmarking scheme of the present invention can utilize the servers and clients of the system of the present invention, as illustrated in FIG. 53, for use in transferring data to or loading data from the servers through the Wide Area Network 5350.

[0162] In general, the present invention is useful for storing, indexing, searching, retrieving, editing, and rendering multimedia content over networks having at least one device capable of storing and/or manipulating an electronic file, and at least one device capable of playing the electronic file. The present invention provides various methodologies for tagging multimedia files to facilitate the indexing, searching, and retrieving of the tagged files. The tags themselves can be embedded in the electronic file, or stored separately in, for example, a search engine database. Other embodiments of the present invention facilitate the e-mailing of multimedia content. Still other embodiments of the present invention employ user preferences and user behavioral history that can be stored in a separate database or queue, or can also be stored in the tag related to the multimedia file in order to further enhance the rich search capabilities of the present invention.

[0163] Other aspects of the present invention include using hypershell and other techniques to read text information embedded in multimedia files for use in indexing, particularly tag indexes. Still more methods of the present invention enable the virtual editing of multimedia files by manipulating metadata and/or tags rather than editing the multimedia files themselves. Then the edited file (with rearranged tags and/or metadata) can be accessed in sequence in order to link seamlessly one or more multimedia files in the new edited arrangement.

[0164] Still other methods of the present invention enable the transcoding of images/videos so that they enable users to display images/videos on devices that do not have the same resolution capabilities as the devices for which the images/videos were originally intended. This allows devices such as, for example, PDA 5330, laptop 5322, and automobile 5302, to retrieve useable portions of the same image/video that can be displayed on, for example, workstation 5312, screen 5308, and television 5332.

[0165] Finally, the indexing methods of the present invention are enhanced by the unique modification of visual rhythm techniques that are part of other methods of the present invention. Modification of prior art visual rhythm techniques enable the system of the present invention to capture text information in the form of captions that are embedded into multimedia information, and even from video streams as they are broadcast, so that text information about the multimedia information can be included in the multimedia bookmarks of the present invention and utilized for storing, indexing, searching, retrieving, editing and rendering of the information.

[0166] 1. Multimedia Bookmark

[0167] The methods of the present invention described in this disclosure can be implemented, for example, in software on a digital computer having a processor that is operable with system memory and a persistent storage device. However, the methods described herein may also be implemented entirely in hardware, or entirely in software, and in any combination thereof.

[0168] In general, after a multimedia content is analyzed automatically and/or annotated by a human operator, the results of analysis and annotation are saved as "metadata" with the multimedia content. The metadata usually include information on description of multimedia data content such as distinctive characteristic of the data, structure and semantics of the content. Some of the description provides information on the whole content such as summary, bibliography and media format. However, in general, most of the description is structured around "segments" that represent spatial, temporal or spatial-temporal components of the audio-visual content. In the case of video content, the segment may be a single frame, a single shot consisting of successive frames, or a group of several successive shots. Low-level features and some elementary semantic information may describe each segment. Examples of such descriptions include color, texture, shape, motion, audio features and annotated texts.

[0169] If it is desired to generate metadata for several variations of a multimedia content, it would be natural to generate the metadata only for a single variation, called a master file, and then have the other variations share the same metadata. This sharing of metadata would save a lot of time and effort by skipping the time-consuming and labor-intensive work of generating multiple versions of metadata. In this case, the media positions (in terms of time points or bytes) contained in the metadata obtained with respect to the master file may not be directly applied to the other variations. This is because there may be mismatches of media positions between the master and the other variations if the master and the other variations do not start at the same position of the source content.

[0170] The method and system of the present invention include a tag that can contain information about all or a portion of a multimedia file. The tag can come in several varieties, such as text information embedded into the multimedia file itself, appended to the end of the multimedia file, or stored separately from the multimedia file on the same or remote network storage device.

[0171] Alternatively, the multimedia file has embedded within it one or more global unique identifiers (GUIDs). For example, each scene in a movie can be provided with its own GUID. The GUIDs can be indexed by a search engine and the multimedia bookmarks of the present invention can reference the GUID that is in the movie. Thus, multiple multimedia bookmarks of the present invention can reference the same GUID in a multimedia document without impacting the size of the multimedia document, or the performance of servers handling the multimedia document. Furthermore, the GUID references in the multimedia bookmarks of the present invention are themselves indexable. Thus, a search on a given multimedia document can prompt a search for all multimedia bookmarks that reference a GUID embedded within the multimedia file, providing a richer and more extensive resource for the user.

[0172] FIG. 2 shows a multimedia bookmark 210 of the present invention comprising positional information 212 and content information 214. The positional information 212 is used for accessing a multimedia content 204
starting from a bookmarked position 206. The content information 214 is used for visually displaying multimedia bookmarks in a bookmark list 208, as well as for searching one or more multimedia content databases for the content that matches the content information 214.

[0173] The positional information 212 may be composed of a URI, a URL, or the like, and a bookmarked position (relative time or byte position) within the content. For the purposes of this disclosure, a URI is synonymous with a position of a file and can be used interchangeably with a URL or other file location identifier. The content information 214 may be composed of audio-visual features and textual features. The audio-visual features are the information, for example, obtained by capturing or sampling the multimedia content 204 at the bookmarked position 206. The textual features are text information specified by the user(s), as well as delivered with the content. Other aspects of the textual features may be obtained by accessing metadata of the multimedia content.

[0174] In one embodiment of the multimedia bookmark 210 of the present invention, the positional information 212 is composed of a URI and a bookmarked position like an elapsed time, time code or frame number. The content information 214 is composed of audio-visual features, such as thumbnail image data of the captured video frame, and visual feature vectors like color histogram for one or more of the frames. The content information 214 of a multimedia bookmark 210 is also composed of such textual features as a title specified by a user as well as delivered with the content, and annotated text of a video segment corresponding to the bookmarked position.

[0175] In the case of an audio bookmark of the present invention, the positional information 212 is composed of a URI, a URL, or the like, and a bookmarked position such as elapsed time. Similarly, the content information 214 is composed of audio-visual features such as the sampled audio signal (typically of short duration) and its visualized image. The content information 214 of an audio bookmark 210 is also composed of such textual features as a title, optionally specified by a user or simply delivered with the content, and annotated text of an audio segment corresponding to the bookmarked position. In the case of a text bookmark 210, the positional information 212 is composed of a URI, URL, or the like, and an offset from the starting point of a text document. The offset can be of any size, but is normally about a byte in size. The content information 214 is composed of a sampled text string present at the bookmarked position, and text information specified by user(s) and/or delivered with the content, such as the title of the text document.

[0176] FIG. 3 shows an illustration of searching for multimedia contents that are relevant to the content information 314 (that correlates to element 214 of FIG. 2) that is stored in the multimedia bookmark 310
(that correlates to element 310 of FIG. 2) of the present invention where both positional and content information are used. The content information 314 is comprised of audio-visual features 320 such as a captured frame 322 and a sampled audio data 324, and textual features 326 such as annotated text 328 and a title 330. There are many cases where a bookmark system that utilizes only positional information, such as URI and an elapsed time, such as that used by conventional bookmarks, may not be valid. For example, if a bookmark were generated during the preview of multimedia content broadcast, the bookmark would not be valid for viewing a full version of the broadcast. If a bookmark were saved during live Internet broadcast, the bookmark would not be valid for viewing an edited version of the live broadcast. Further, if a user wanted to access the bookmarked multimedia content from another site that also provides the content, even the positional information such as URI would be not be valid.

[0177] To solve the problems described in the background section, the present invention uses content information 314 (element 214 of FIG. 2) that is saved in the multimedia bookmark to obtain the actual positional information of the last-visited segment by searching the multimedia database 310 using the content information 314 as a query input. Content information characteristics such as captured frame 322, sampled audio data 324, annotated text of the segment corresponding to a bookmarked position 328, and the title delivered with the content 330 can be used as query input to a multimedia search engine 332. The multimedia search engine searches its multimedia database 310 by performing content-based and/or text-based multimedia searches, and finds the relevant positions of multimedia contents. The search engine then retrieves a list of relevant segments 334 with their positional information such as URI, URL and the like, and the relative position. With a multimedia player 336, a user can start playing from the retrieved segments of the contents. The retrieved segments 334 are usually those segments having contents relevant or similar to the content information saved in the multimedia bookmark.

[0178] FIG. 4 illustrates an embodiment of a key frame hierarchy used by a search method of the multimedia search engine 332 (see FIG. 3) in accordance with the present invention. The method arranges key frames in a hierarchical fashion to enable fast and accurate searching of frames similar to a query image.

[0179] The key frame hierarchy illustrated in FIG. 4 is a tree-structured representation for multi-level abstraction of a video by key frames, where a node denotes each key frame. A number Df is associated with each node and represents the maximum distance between the low-level feature vector of the node 414 and those of its decendent nodes in its subtree (for example, nodes 416 and 418). An example of such feature vector is the color histogram of a frame. If a video database composed of one or more key frame hierarchies, which correspond to different video sequences, must be searched to find a specific query image fq, the dissimilarity between fq and a subtree rooted at the key frame fm is measured by testing d(fq, fm)>Df+e where d,(fq, fm) is a distance metric measuring dissimilarity such as the L1 norm between feature vectors, and e is a threshold value set by a user. If the condition is satisfied, searching of the subtree rooted at the node fm is skipped (i.e., the subtree is "pruned" from the search). This method of the present invention reduces the search time substantially by pruning out the unnecessary comparison steps.

[0180] Durable Multimedia Bookmark using Offset and Time Scale

[0181] FIG. 5 shows an example of five variations encoded from the same source video content 502. FIG. 5 shows two ASF format files 504, 506 with the bandwidths of 28.8 and 80 kbps that start and end exactly at the same time points. FIG. 5 also shows the first RM format file 508 with the bandwidth of 80 kbps. In the RM file 508, source content starts to be encoded with the time interval o.sub.1 before the start time point of the ASF files 504, 506, and ends to be encoded with the time interval o.sub.4, before the end time point of the ASF files 504 and 506. The RM file 508 thus has an extra video segment with the duration of o.sub.1 at the beginning. Consequently, compared with a start time point of a specific video segment 514 in the ASF files, the start time point of the video segment in the RM file is temporally shifted right with the time interval o.sub.1. The start time point of the video segment in the RM file can be computed by adding the time interval o.sub.1 the start time point of the video segment in the ASF files. Similarly, the second RM file 510 with the bandwidth of 28.8 kbps does not have a leading video segment with the duration of o.sub.2. The start time point of the video segment 514 in the second RM file can be computed by subtracting the time interval o.sub.2 from the start time point of the video segment in the ASF files. Also, the MOV file 512 with the smart bandwidth of 56 kbps has two extra segments with the duration of o.sub.3 and o.sub.6, respectively.

[0182] In another example, designate one of the different variations encoded with the same source multimedia content as the master file, and the other variations as slave files. In the example illustrated in FIG. 5, the ASF file encoded at the bandwidth of 80 kbps 504 is to be the master file, and the other four files are slave files. In this example, an offset of a slave file will be the difference of positions in time duration or byte offset between a start position of a master file and a start position of the slave file. In this example, the difference of positions o.sub.1, o.sub.2, and o.sub.3 are offsets. The offset of a slave file is computed by subtracting the start position of a slave file from the start position of a master file. In this formula, the two start positions are measured with respect to the source content. Thus, the offset will have a positive value if the start position of a slave occurred before the start position of a master with reference to the source content. Conversely, the offset will have a negative value if the start position of a slave occurred after the start position of a master. For the example shown in FIG. 5, the offsets o.sub.1 and o.sub.3 are positive values, and o.sub.2 is negative. Although not specifically required, by convention an offset of a master file is set to zero.

[0183] Consider the different variations encoded from the same source multimedia content. A user generates a multimedia bookmark with respect to one of the variations that is to be called a bookinarked file. Then, the multimedia bookmark is used at a later time to play one of the variations that is called a playback file. In other words, the bookmarked file pointed to by the multimedia bookmark, and the playback file selected by the user, may not be the same variation, but refer to the same multimedia content.

[0184] If there is only one variation encoded from the original content, both the bookmarked and the playback files should be the same. However, if there are multiple variations, a user can store a multimedia bookmark for one variation and later play another variation by using the saved bookmark. The playback may not start at the last accessed position because there may be mismatches of positions between the bookmarked and the playback files.

[0185] Associated with a multimedia content are metadata containing the offsets of the master and slave variations of the multimedia content in the form of media profiles. Each media profile corresponds to the different variation that can be produced from a single source content depending on the values chosen for the encoding formats, bandwidths, resolutions, etc. Each media profile of a variation contains at least a URI and an offset of the variation. Each media profile of a variation optionally contains a time scale factor of the media time of the variation encoded in different temporal data rates with respect to its master variation. The time scale factor is specified on a zero to one scale where a value of one indicates the same temporal data rate, and 0.5
indicates that the temporal data rate of the variation is reduced by half with respect to the master variation.

[0186] Table 1 is an example metadata for the five variations in FIG. 5. The metadata is written according t