United States Patent5850352
Moezzi , ; et al.December 15, 1998

Title

Immersive video, including video hypermosaicing to generate from multiple video views of a scene a three-dimensional video mosaic from which diverse virtual video scene images are synthesized, including panoramic, scene interactive and stereoscopic images

Abstract

Immersive video, or television, images of a real-world scene are synthesized, including on demand and/or in real time, as are linked to any of a particular perspective on the scene, or an object or event in the scene. Synthesis is in accordance with user-specified parameters of presentation, including presentations that are any of panoramic, magnified, stereoscopic, or possessed of motional parallax. The image synthesis is based on computerized video processing--called "hypermosaicing"--of multiple video perspectives on the scene. In hypermosaicing a knowledge database contains information about the scene; for example scene geometry, shapes and behaviors of objects in the scene, and/or internal and/or external camera calibration models. Multiple video cameras each at a different spatial location produce multiple two-dimensional video images of the scene. A viewer/user specifies viewing criterion (ia) at a viewer interface. A computer, typically one or more engineering work station class computers or better, includes in software and/or hardware (i) a video data analyzer for detecting and for tracking scene objects and their locations, (ii) an environmental model builder combining multiple scene images to build a 3D dynamic model recording scene objects and their instant spatial locations, (iii) a viewer criterion interpreter, and (iv) a visualizer for generating from the 3D model in accordance with the viewing criterion one or more selectively synthesized 2D video image(s) of the scene.


Inventors:Moezzi; Saied (San Diego, CA), Katkere; Arun  (La Jolla, CA), Jain; Ramesh  (San Diego, CA)
Assignee:The Regents of the University of California (Alameda, CA)
Appl. No.:554848
Filed:November 6, 1995

Current U.S. Class:345/419 725/34 
Field of Search:364/514A,410 348/13,19,42,51,48 273/433,441 395/119,125,155,129,162

U.S. Patent Documents
5490239February 1996Myers
5495576February 1996Ritchey
Primary Examiner: Voeltz; Emanuel T.
Assistant Examiner: Peeso; Thomas
Attorney, Agent or Firm:Fuess & Davidenas

Parent Case Text



REFERENCE TO A RELATED PATENT APPLICATION

The present patent application is a continuation-in-part of U.S. patent application Ser. No. 08/414,437 filed on Mar. 31, 1995 to inventors Ramesh Jain and Koji Wakimoto for MACHINE DYNAMIC SELECTION OF ONE VIDEO CAMERA/IMAGE OF A SCENE FROM MULTIPLE VIDEO CAMERAS/IMAGES OF THE SCENE IN ACCORDANCE WITH A PARTICULAR PERSPECTIVE ON THE SCENE, AN OBJECT IN THE SCENE, OR AN EVENT IN THE SCENE. The contents of the related predecessor patent application are incorporated herein by reference.

Claims


What is claim is:
1. A method of telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer, the method comprising:
capturing video of a real-world scene from each of a multiplicity of different spatial perspectives on the scene;
predetermining a fixed framework of the scene as to the boundaries of the scene and selected fixed points of reference within the scene, the fixed framework and fixed reference points potentially but not necessarily coinciding with landmark objects in the scene if, indeed, any such landmark objects even exist;
creating from the captured video in consideration of the predetermined fixed framework a full three-dimensional model of the scene, the three-dimensional model being distinguished in that three-dimension occurrences in the scene are incorporated into the model regardless of that they should not have been pre-identified to the model;
producing from the three-dimensional model a video representation on the scene that is in accordance with the desired perspective on the scene of a viewer of the scene, thus immersive telepresence because the viewer can view the scene as if immersed therein, and as if present at the scene, all in accordance with his/her desires;
wherein the representation is called immersive telepresence because it appears to the viewer that, since the scene is presented as the viewer desires, the viewer is immersed in the scene;
wherein the viewer-desired perspective on the scene, and the video representation in accordance with this viewer-desired perspective, need not be in accordance with any of the captured video.

2. The method of immersive telepresence according to claim 1
wherein the video representation is stereoscopic;
wherein stereoscopy is, normally and conventionally, a three-dimensional effect where each of the viewer's two eyes sees a slightly different view on the scene, making the viewer's brain to comprehend that the viewed scene is three-dimensional even should the viewer not move his/her head or eyes in spatial position.

3. A method of immersive telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer, the method comprising:
capturing video of a real-world scene from each of a multiplicity of different spatial perspectives on the scene;
creating from the captured video a full three-dimensional model of the scene;
producing from the three-dimensional model a video representation on the scene that is in accordance with the desired perspective on the scene of a viewer of the scene, thus immersive telepresence because the viewer can view the scene as if immersed therein, and as if present at the scene, all in accordance with his/her desires;
wherein the representation is called immersive telepresence because it appears to the viewer that, since the scene is presented as the viewer desires, the viewer is immersed in the scene;
wherein the viewer-desired perspective on the scene, and the video representation in accordance with this viewer-desired perspective, need not be in accordance with any of the captured video;
wherein the video representation is in accordance with the position and direction of the viewer's eyes and head, and exhibits motional parallax;
wherein motional parallax is, normally and conventionally, a three-dimensional effect where different views on the scene are produced as the viewer moves position even should the viewer have but one eye, making the viewer's brain to comprehend that the viewed scene is three-dimensional.

4. A method of telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer, the method comprising:
capturing video of a real-world scene from a multiplicity of different spatial perspectives on the scene;
creating from the captured video a full three-dimensional model of the scene;
producing from the three-dimensional model a video representation on the scene responsively to a predetermined criterion selected from among criteria including an object in the scene and an event in the scene, thus interactive telepresence because the presentation to the viewer is interactive in response to the criterion;
wherein the video presentation of the scene in accordance with the criterion need not be in accordance with any of the captured video.

5. The method of viewer-interactive telepresence according to claim 4
wherein the video representation is in response to a criterion selected by the viewer, thus viewer-interactive telepresence.

6. The method of viewer-interactive telepresence according to claim 5 wherein the presentation is in response to the position and direction of the viewer's eyes and head, and exhibits motional parallax.

7. The method of viewer-interactive telepresence according to claim 5 wherein the presentation exhibits stereoscopy.

8. An immersive video system for presenting video images of a real-world scene in accordance with a predetermined criterion, the system comprising:
a knowledge database containing information about the spatial framework of the real-world scene:
multiple video sources each at a different spatial location for producing multiple two-dimensional video images of a real-world scene each at a different spatial perspective;
a viewer interface at which a prospective viewer of the scene may specify a criterion relative to which criterion the viewer wishes to view the scene;
a computer, receiving the multiple two-dimensional video images of the scene from the multiple video cameras and the viewer-specified criterion from the viewer interface, the computer for calculating in accordance with the spatial framework of the knowledge database as
a video data analyzer for detecting and for tracking objects of potential interest and their locations in the scene,
an environmental model builder for combining multiple individual video images of the scene to build a three-dimensional dynamic model of the environment of the scene within which three-dimensional dynamic environmental model potential objects of interest in the scene are recorded along with their instant spatial locations, and
a viewer criterion interpreter for correlating the viewer-specified criterion with the objects of interest in the scene, and with the spatial locations of these objects, as recorded in the dynamic environmental model in order to produce parameters of perspective on the scene, and
a visualizer for generating, from the three-dimensional dynamic environmental model in accordance with the parameters of perspective, a particular two-dimensional video image of the scene; and
a video display, receiving the particular two-dimensional video image of the scene from the computer, for displaying this particular two-dimensional video image of the real-world scene to the viewer as that particular view of the scene which is in satisfaction of the viewer-specified criterion.

9. An immersive video system for presenting video images of a real-world scene in accordance with a predetermined criterion, the system comprising:
multiple video sources each at a different spatial location for producing multiple two-dimensional video images of a real-world scene each at a different spatial perspective;
a knowledge database containing information about the real-world scene regarding at least two of
the geometry of the real-world scene,
potential shapes of objects in the real-world scene,
dynamic behaviors of objects in the real-world scene, and
a camera calibration model;
a viewer interface at which a prospective viewer of the scene may specify a criterion relative to which criterion the viewer wishes to view the scene;
a computer, receiving the multiple two-dimensional video images of the scene from the multiple video cameras and the viewer-specified criterion from the viewer interface, the computer operating in consideration of the knowledge database and including
a video data analyzer for detecting and for tracking objects of potential interest and their locations in the scene,
an environmental model builder for combining multiple individual video images of the scene to build a three-dimensional dynamic model of the environment of the scene within which three-dimensional dynamic environmental model potential objects of interest in the scene are recorded along with their instant spatial locations, and
a viewer criterion interpreter for correlating the viewer-specified criterion with the objects of interest in the scene, and with the spatial locations of these objects, as recorded in the dynamic environmental model in order to produce parameters of perspective on the scene, and
a visualizer for generating, from the three-dimensional dynamic environmental model in accordance with the parameters of perspective, a particular two-dimensional video image of the scene; and
a video display, receiving the particular two-dimensional video image of the scene from the computer, for displaying this particular two-dimensional video image of the real-world scene to the viewer as that particular view of the scene which is in satisfaction of the viewer-specified criterion.

10. The immersive video system according to claim 9 wherein the knowledge database contains data regarding each of
the geometry of the real-world scene,
potential shapes of objects in the real-world scene,
dynamic behaviors of objects in the real-world scene, and
a camera calibration model.

11. The immersive video system according to claim 9 wherein the camera calibration model of the knowledge database includes at least one of
an internal camera calibration model, and
an external camera calibration model.

12. An improvement to the method of video mosaicing, which video mosaicing method uses video frames from a video stream of a single video camera panning a scene, or, equivalently, the video frames from each of multiple video cameras each of which images only a part of the scene, in order to produce a larger video scene image than any single video frame from any single video camera,
the improved method being directed to generating a spatial-temporally coherent and consistent three-dimensional video mosaic from multiple individual video streams arising from each of multiple video cameras each of which is imaging at least a part of the scene from a perspective that is at least in part different from other ones of the multiple video cameras,
the improved method being called video hypermosaicing,
the video hypermosaicing method being applied to scenes where at a least a portion of the scene from the perspective of at least one camera is static, which limitation is only to say that absolutely everything in every part of the scene as is imaged to each of the multiple video cameras cannot be simultaneously in dynamic motion, the video hypermosaicing comprising:
accumulating and storing as a priori information the static portion of the scene as a CSG/CAD model of the scene; and
processing, in consideration of the CSG/CAD model of the scene, dynamic portions of the scene, only, from the multiple video steams of the multiple video cameras so as to develop a spatial-temporally coherent and consistent three-dimensional video mosaic of the scene;
wherein the processing of static portions of the scene is bypassed;
wherein bypassing of processing the static portions of the scene reduces the complexity of processing the scene.

13. The video hypermosaicing according to claim 12 wherein the accumulating and storing is of
the geometry of the real-world scene,
potential shapes of objects in the real-world scene,
dynamic behaviors of objects in the real-world scene,
an internal camera calibration model and parameters, and
an external camera calibration model and parameters, as the priori information regarding the static portion of the scene, and as the CSc/CAD model of the scene.

14. The video hypermosaicing according to claim 13 wherein the processing comprises:
building and maintaining a comprehensive three-dimensional video model of the scene by steps including
calibrating the multiple cameras in three-dimensional space by use of the internal and external camera calibration models and parameters,
extracting all dynamic objects in the multiple video streams of the scene,
localizing each extracted dynamical object in the three-dimensional model, updating positions of existing objects and creating new objects in the model as required, and
mosaicing pixels from the multiple video streams by steps.

15. A method of composing arbitrary new video vistas on a scene from multiple video streams of the scene derived from different spatial perspectives on the scene, the method called video hypermosaicing because it transcends the generation of a two-dimensional video mosaic by video mosaicing and instead generates a spatial-temporally coherent and consistent three-dimensional video mosaic from multiple individual video streams arising from each of multiple video cameras each of which is imaging at least a part of the scene from a perspective that is at least in part different from other ones of the multiple video cameras, the video hypermosaicing composing method comprising:
receiving multiple video streams on a scene each of which streams comprises multiple pixels in a vista coordinate system V: {(x.sub.v, y.sub.v, z.sub.v)};
finding for each pixel (x.sub.v, y.sub.v, d.sub.v (x.sub.v, y.sub.v) on the vista the corresponding pixel point (x.sub..omega., y.sub..omega., z.sub..omega.) in a model, or world, coordinate system W: {(x.sub..omega., y.sub..omega., z.sub..omega.) by using the depth value of the pixel, to wit [x.sub..omega. y.sub..omega. z.sub..omega. 1].sup.T =M.sub.v .multidot. [x.sub.v y.sub.v z.sub.v 1].sup.T ;
projecting the found corresponding pixel point onto each of a plurality of camera image planes c of a camera coordinate system C: {(x.sub.c, y.sub.c, z.sub.c)} by [x.sub.c y.sub.c z.sub.c 1].sup.T =M.sub.c.sup.-1 .multidot.[x.sub..omega. y.sub..omega. z.sub..omega. 1].sup.T where M.sub.c is the 4.times.4 homogeneous transformation matrix representing transformation between c and the world coordinate system, in order to produce camera coordinate pixel points (x.sub.c, y.sub.c, z.sub.c) .A-inverted.c;
testing said camera coordinate pixel points (x.sub.c, y.sub.c, z.sub.c) .A-inverted.c for occlusion from view by comparing z.sub.c with the depth value for the found corresponding pixel point so as to produce several candidates that could be used for the pixel (x.sub.c, y.sub.c) for the vista;
evaluating each candidate view cv by a criteria, to wit, first computing an angle A subtended by a line between a candidate camera and a vista position with the object point (x.sub..omega., y.sub..omega., z.sub..omega.) by use of the cosine formula A=arccos .sqroot.(b.sup.2 +c.sup.2 -a.sup.2)/(2bc), and then computing the distance of the object point (x.sub..omega., y.sub..omega., z.sub..omega.) from camera window coordinate (x.sub.c, y.sub.c), which is the depth value d.sub.c (x.sub.c, y.sub.c);
evaluating each candidate view by an evaluation criterion e.sub.cv =f (A, B*d.sub.c (x.sub.c, y.sub.c)), where B is a small number; and
repeating the receiving, the finding, the projecting, the testing and the evaluating for an instance of time of each video frame assuming a stationary viewpoint.

16. The hypermosaicing composing method according to claim 15 extended to produce a fly-by sequence of view in the world coordinate system, the extended method comprising:
repeating the receiving, the finding, the projecting, the testing and the evaluating for every point of a view port in the world coordinate system.

17. A method of presenting a particular stereoscopic two-dimensional video image of a real-world three dimensional scene to a viewer in accordance with a criterion supplied by the viewer, the method comprising:
imaging in multiple video cameras each at a different spatial location multiple two-dimensional video images of a real-world scene each at a different spatial perspective;
combining in a computer the multiple two-dimensional images the scene into a three-dimensional model of the scene;
receiving in a the computer from a prospective viewer of the scene a viewer-specified criterion relative to which criterion the viewer wishes to view the scene;
synthesizing, in a computer from the three-dimensional model in accordance with the received viewer criterion, a stereoscopic two-dimensional image that is without exact correspondence to any of the images of the real-world scene that are imaged by any of the multiple video cameras; and
displaying in a video display the particular stereoscopic two-dimensional image of the real-world scene to the viewer.

18. The method according to claim 17
wherein the receiving is of the viewer-specified criterion of a particular spatial perspective, relative to which particular spatial perspective the viewer washes to view the scene; and
wherein the synthesizing in the computer from the three-dimensional model is of a particular two-dimensional image of the scene in accordance with the particular spatial perspective criterion received from the viewer; and
wherein the displaying in the video display is of the particular stereoscopic two-dimensional image of the scene that is in accordance with the particular spatial perspective received from the viewer.

19. The method according to claim 17 performed in real time as television presented to a viewer interactively in accordance with the viewer-specified criterion.

20. A method of presenting a particular stereoscopic two-dimensional video image of a real-world three dimensional scene to a viewer in accordance with a criterion supplied by the viewer, the method comprising:
imaging in multiple video cameras each at a different spatial location multiple two-dimensional video images of a real-world scene each at a different spatial perspective;
combining in a computer the multiple two-dimensional images of the scene into a three-dimensional model of the scene so as generate a three-dimensional model of the scene in which model objects in the scene are identified;
receiving in a the computer from a prospective viewer of the scene a viewer-specified criterion of a selected object in the scene that the viewer wishes to particularly view;
synthesizing, in a computer from the three-dimensional model in accordance with the received viewer criterion, a particular stereoscopic two-dimensional image of the selected object in the scene; and
displaying to the viewer in the video display the particular stereoscopic image of the scene showing the viewer-selected object.

21. The method according to claim 20 wherein the viewer-selected object in the scene is dynamic, and unmoving, in the scene.

22. The method according to claim 20 wherein the viewer-selected object in the scene is dynamic, and moving, in the scene.

23. The method according to claim 20 wherein selection of the object relative to which the viewer wishes to particularly view in the scene transpires by
viewer positioning of a device of a type that is suitably used with an artificial reality system to sense viewer position and viewer movement and viewer direction of focus;
sensing with the device the viewer position and movement and direction of focus;
unambiguously interpreting in three dimensions an association between, on the one hand, the object position and, on the other hand, the viewer position and movement and direction of focus, so as to specify the object relative to which the viewer wishes to particularly view in the scene;
wherein the association transpires, as the three-dimensional model of the scene supports, in three and not just in two dimensions.

24. A method of presenting a particular stereoscopic two-dimensional video image of a real-world three dimensional scene to a viewer in accordance with a criterion supplied by the viewer, the method comprising:
imaging in multiple video cameras each at a different spatial location multiple two-dimensional video images of a real-world scene each at a different spatial perspective;
combining in a computer the multiple two-dimensional images of the scene into a three-dimensional model of the scene so as generate a three-dimensional model of the scene in which model events in the scene are identified;
receiving in a the computer from a prospective viewer of the scene a viewer-specified criterion of a selected event in the scene that the viewer wishes to particularly view;
synthesizing, in a computer from the three-dimensional model in accordance with the received viewer criterion, a particular stereoscopic two-dimensional image of the selected event in the scene; and
displaying to the viewer in the video display the particular stereoscopic image of the scene showing the viewer-selected event.

25. The method according to claim 24 wherein selection of the object relative to which the viewer wishes to particularly view in the scene transpires by
viewer positioning of a device of a type that is suitably used with an artificial reality system to sense viewer position and viewer movement and viewer direction of focus;
sensing with the device the viewer position and movement and direction of focus;
unambiguously interpreting in three dimensions an association between, on the one hand, the object position and, on the other hand, the viewer position and movement and direction of focus, so as to specify the object relative to which the viewer wishes to particularly view in the scene;
wherein the association transpires, as the three-dimensional model of the scene supports, in three and not just in two dimensions.

26. A method of synthesizing a stereoscopic virtual video image from real video images obtained by a multiple real video cameras, the method comprising:
storing in a video image database the real two-dimensional video images of a scene from each of a multiplicity of real video cameras;
creating in a computer from the multiplicity of stored two-dimensional video images a three-dimensional video database containing a three-dimensional video image of the scene, the three-dimensional video database being characterized in that the three-dimensional location of objects in the scene is within the database; and
synthesizing a two-dimensional stereoscopic virtual video image of the scene from the three-dimensional video database;
wherein the synthesizing is facilitated because the three-dimensional spatial positions of all objects depicted in the stereoscopic virtual video image are known because of their positions within the three-dimensional video database, it being a mathematical transform to present a two-dimensional stereoscopic video image when the three-dimensional positions of objects depicted in the image are known.

27. The method according to claim 26
wherein the synthesizing from the three-dimensional video database is of a two-dimensional stereoscopic virtual video image of the scene having two, a left stereo and a right stereo, image components each of which image components that is without correspondence to any real two-dimensional video image of a scene;
wherein the synthesizing is of a 100% synthetic two-dimensional stereoscopic virtual video image, meaning that although the objects within the image are of the scene as it ms seen by real video cameras, no camera sees the scene as either the left or the right stereo components,
wherein it may be fairly said that the two-dimensional stereoscopic virtual video image results not from stereoscopically imaging, or videotaping, the scene but rather from synthesizing a stereoscopic view of the scene.

28. The method according to claim 26 that, between the creaking and the synthesizing, further comprises:
selecting a spatial perspective, which spatial perspective is not that of any of the multiplicity of real video cameras, on the scene as the scene is imaged within the three-dimensional video model;
wherein the generating of the two-dimensional stereoscopic virtual video image is so as to show the scene from the selected spatial perspective.

29. The method according to claim 28 wherein the selected spatial perspective is static, and fixed, during the video of the scene.

30. The method according to claim 28 wherein the selected spatial perspective is dynamic, and variable, during the video of the scene.

31. The method according to claim 28 wherein the selected spatial perspective is so dynamic and variable dependent upon occurrences in the scene.

32. The method according to claim 26 that, between the creating and the generating, further comprises:
locating a selected object in the scene as is imaged within the three-dimensional video model;
wherein the generating of the two-dimensional stereoscopic virtual video image is so as to best show the selected object.

33. The method according to claim 26 that, between the creating and the generating, further comprises:
dynamically tracking the scene as is imaged within the three-dimensional video model in order to recognize any occurrence of a predetermined event in the scene;
wherein the generating of the two-dimensional stereoscopic virtual video image is so as to best show the predetermined event.

34. The method according to claim 26 wherein the generating is of a selected two-dimensional stereoscopic virtual video image, on demand.

35. The method according to claim 26 wherein the generating of the selected two-dimensional stereoscopic virtual video image is in real time on demand, thus interactive virtual television.

36. A computerized method for presenting video images including a real-world scene, the method comprising:
constructing a three-dimensional environmental model containing both static and dynamic elements of the real world scene;
producing multiple video streams showing two-dimensional images on the real-world scene from differing spatial positions;
identifying static and dynamic portions of each of the multiple video streams;
first warping at least some of corresponding portions of the multiple video streams onto the three-dimensional environmental model as reconstructed three-dimensional objects, wherein at least some image portions that are represented two-dimensionally in a single video stream assume a three-dimensional representation; and
to synthesizing a two-dimensional video image that is without equivalence to any of the two-dimensional images that are within the multiple video streams from the three-dimensional environmental model containing the three-dimensional objects.

37. The method according to claim 36 wherein the first warping is of at least some dynamic elements.

38. The method according to claim 37 wherein the first warping is also of at least some static scene elements.

39. The method according to claim 36 further comprising:
second warping at least some of corresponding portions of the multiple video streams onto the three-dimensional environmental model as two-dimensional representations, wherein at least some image portions that are represented two-dimensionally in a single video stream are still represented two-dimensionally even when warped onto an environmental model that is itself three-dimensional; and
wherein the synthesizing of the two-dimensional video image is from the two-dimensional representations, as well as the reconstructed three-dimensional objects, that were both warped onto the three-dimensional environmental model.

40. The method according to claim 36 wherein the identifying of the static and dynamic portions of each of the multiple video streams transpires by tracking changes in scene element representations in the multiple video streams over time.

41. The method according to claim 36 wherein the environmental model determines whether any scene portion or scene element is to be warped onto itself as either a two-dimensional representation or as a reconstructed three-dimensional object.

42. The method according to claim 41 wherein the synthesizing in accordance with a viewer specified criterion is dynamic in accordance with such criterion, and, although the criterion does not change, the scene selection responsive thereto is of a varying, and not a same and consistent, view on the scene from time to time.

43. The method according to claim 36 wherein the synthesizing is in accordance with a viewer specified criterion.

44. The method according to claim 43 wherein the dynamic synthesizing is in accordance with a viewer specified criterion of any of
an object in the scene, and
an event in the scene.

45. The method according to claim 43 wherein the dynamic synthesizing is of a stereoscopic image.

46. A computer system, receiving multiple video images of views on a real world scene, for synthesizing a video image of the scene which synthesized image is no identical to any of the multiple received video images, the system comprising:
an information base containing a geometry of the real-world scene, shapes and dynamic behaviors expected from moving objects in the scene, plus internal and external camera calibration models on the scene;
a video data analyzer means for detecting and for tracking objects of potential interest in the scene, and locations of these objects;
a three-dimensional environmental model builder means for recording the detected and tracked objects at their proper locations in a three-dimensional model of the scene, the recording being in consideration of the information base;
a viewer interface means responsive to a viewer of the scene to receive a viewer selection of a desired view on the scene, which desired view need not be identical to any views that are within any of the multiple received video images; and
a visualizer means for generating from the three-dimensional model of the scene in accordance with the received desired view a video image on the scene that so shows the scene from the desired view.

Description

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally concerns (i) multimedia, (ii) video, including video-on-demand and interactive video, and (iii) television, including television-on-demand and interactive television.

The present invention particularly concerns synthesizing diverse spatially and temporally coherent and consistent virtual video cameras, and a virtual video images, from multiple real video images that are obtained by multiple real video cameras.

The present invention still further concerns the creation of three-dimensional video image models, and the location and dynamical tracking of video images of selected objects depicted in the models for, among other purposes, the selection of a real camera or image, or the synthesis of a virtual camera or image, best showing the object selected.

The present invention still further concerns (i) interactive synthesis of video, or television, images of a real-world scene on demand, (ii) the synthesis of virtual video images of a real-world scene in real time, or virtual television, (iii) the synthesis of virtual video images/virtual television pictures of a real-world scene which video images/virtual television are linked to any of a particular perspective on the video/television scene, an object in the video/television scene, or an event in the video/television scene, (iv) the synthesis of virtual video images/virtual television pictures of a real-world scene wherein the pictures are so synthesized to user-specified parameters of presentation, e.g. panoramic, or at magnified scale if so desired by the user, and (v) the synthesis of 3D stereoscopic virtual video images/virtual television.

2. Description of the Prior Art

2.1 Limitations in the Viewing of Video and Television Dealt with by the Predecessor MPI Video System and Method, and the Relationship of the Present Invention

The traditional model of television and video is based on a single video stream transmitted to a passive viewer. A viewer has the option to watch the particular video stream, and to rewatch should the video be recorded, but little else. Due to the emergence of the information highways and other related information infrastructure circa 1995, there has been considerable interest in concepts like video-on-demand, interactive movies, interactive TV, and virtual presence. Some of these concepts are exciting, and suggest many dramatic changes in society due to the continuing dawning of the information age.

The related predecessor patent applications teaches that a novel form of video, and television, is possible--and has, indeed, already been reduced to operative practice--where a viewer of video, or television, may select a particular perspective from which perspective a real-world scene will henceforth be presented. The viewer may alternatively select a particular object--which may be a dynamically moving object--or even an event in the real world scene that is of particular interest. As the scene develops its presentation to the viewer will prominently feature the selected object or the selected event (if occurring).

Accordingly, video presentation of a real-world scene in accordance with the related predecessor inventions is interactive with both (i) a viewer of the scene and, in the case of a selected dynamically moving object, or an event, in the scene, (ii) the scene itself. True interactive video or television is thus presented to a viewer. The image presented to the viewer may be a full virtual image that is not mandated to correspond to any real camera nor to any real camera image. A viewer may thus view a video or television of a real-world scene from a vantage point (i.e., a perspective on the video scene), and/or dynamically in response to objects moving in the scene and/or events transpiring in the scene, in manner that is not possible in reality. The viewer may, for example, view the scene from a point in the air above the scene, or from the vantage point of an object in the scene, where no real camera exists or even, in some cases, can exist.

The predecessor video system, and approach, is called Multiple Perspective Interactive ("MPI") video. MPI video is the basis, and the core, of the "immersive video" (non-real-time and "immersive telecresence" or "Visualized Reality (VisR) (real-time), systems of the present invention. The MPI Video system itself overcomes several limitations of the conventional video. See, for example, 1) Wendy E. Mackay and Glorianna Davenport; "Virtual video editing in interactive multimedia applications" appearing in Communications of the ACM, 32(7): 802-810, July 1989; 2) Eitetsu Oomoto and Katsumi Tanaka; "Ovid: Design and implementation of a video-object database system" submitted in Spring 1995 to IEEE Transactions on Knowledge and Data Engineering; 3) Glorianna Davenport, Thomas Aguirre Smith, and Natalio Pincever; "Cinematic primitives for multimedia" appearing in IEEE Computer Graphics & Applications, pages 67-74, July 1991; and 4) Anderson H. Gary; Video Editing and Post Production: A Professional Guide, Knowledge Industry Publications, 1988.

MPI video supports the editing of, and viewer interaction with, video and television in a manner that is useful in viewing activities ranging from education to entertainment. In particular, in conventional video, viewers are substantially passive; all they can do is to control the flow of video by pressing buttons such as play, pause, fast forward or fast reverse. These controls essentially provide the viewer only one choice for a particular segment of video: the viewer can either see the video (albeit at a controllable rate), or skip it.

In the case of live television broadcast, viewers have essentially no control at all. A viewer must either see exactly what a broadcaster chooses to show, or else change away from that broadcaster and station. Even in sports and other broadcast events where multiple cameras are used, a viewer has no choice except the obvious one of either viewing the image presented or else using a remote control so as to "surf" multiple channels.

Interactive video and television systems such as MPI video make good use of the availability of increased video bandwidth due to new satellite and fiber optic video links, and due to advances in several areas of video technology. Author George Gilder argues that because the viewers really have no choice in the current form of television, it is destined to be replaced by a more viewer-driven system or device. See George Gilder; Life After Television: The coming transformation of Media and American Life, W. W. Norton & Co., 1994.

The related invention of MPI video makes considerable progress--even by use of currently existing technology--towards "liberating" video and TV from the traditional single-source, broadcast, model, and towards placing each viewer in his or her own "director's seat".

A three-dimensional (3D) video model, or database, is used in MPI video. The immersive video and immersive telepresence systems of the present invention preserve, expand, and build upon this 3D model. This three-dimensional model, and the functions that it performs, are well and completely understood, and will be completely taught within this specification. However, the considerable computational power required if a full custom virtual video image for each viewer is to be synthesized in real time and on demand requires that the model should be constructed and maintained in consideration of (1) powerful organizing principles, (ii) efficient algorithms, and (iii) effective and judicious simplifying assumptions. This then, and more, is what the present invention will be seen to concern.

2.2 Related MPI-Video

For the sake of completeness, the purposes of the Multiple Perspective Interactive Video, or MPI-video, that is the subject of the related predecessor application are recapitulated in this application.

MPI video presents requirements that re both daunting and expensive, but realizable in terms of the communications and computer hardware available circa 1995. About 10.sup.3 more video data than is within a modern television channel may usefully be transmitted to each viewer. Each viewer may usefully benefit from the computational power equivalent to several powerful engineering work station computers (circa 1995). Once this is done, however, then the "bounds of earth" are shed, and a viewer may interact with any three-dimensional real-world scene much as if he/she were an omnipotent, prescient, being whose vantage point on the scene is unfettered save only that it must be through a two-dimensional video "window" of the viewer's choice.

These functions performed by MPI video prospectively serve to make MPI video a revolutionary new media. Even rudimentary, presently realized, embodiments of MPI video do many useful things. For example, in the particular context of the video (and television) presentation of American football (in which environment the model is exercised), some few football players, and the football itself, will be seen to be susceptible of being automatically "tracked" during play in order that a video image presented to a viewer by the system may be selectively "keyed" to the action of the game.

A "next step" in MPI video beyond this rudimentary implementation is as a non-real-time pre-processed "game video". Such a "game video" may be recorded on the now-emerging new-format Video CD. Some twenty-three different "tracks", for example, may be recorded to profile each player on the field from both teams, and also the football.

A "next step" in MPI video beyond even this will be to send the same information on twenty-three channels live, and in real time, on game day. Subscriber/viewer voting may permit a limited interaction. For example, the "fans" around a particular television might select a camera, or synthesis of a virtual camera, profiling the "defensive backs".

Finally, and what will undoubtedly transpire only after the lapse of some years from the present time (1995), it should be possible for each fan to be his or her own "game director", and to watch in real time substantially exactly what he or she wants.

Accordingly, to exercise even the MPI video system at its maximum capability, some advancement of technology will be useful, and is confidently expected, in the fields of computer vision, multimedia database and human interface.

See, for example, Swanberg: 1) Deborah Swanberg, Terry Weymouth, and Ramesh Jain, "Domain information model: an extended data model for insertions and query", appearing in Proceedings of the Multimedia Information Systems, pages 39-51, Intelligent Information Systems Laboratory, Arizona State University, Feb. 1, 1992; and 2) Deborah Swanberg, Chiao-Fe Shu, and Ramesh Jain, "Architecture of a multimedia information system for content-based retrieval", appearing in Audio Video Workshop, San Diego, Calif., November 1992.

See also, for example, Hampapur: 1) Arun Hampapur, Ramesh Jain, and Terry Weymouth, "Digital video segmentation", appearing in Proceedings of the ACM conference on MultiMedia, Association of Computing Machinery, October 1994; and 2) Arun Hampapur, Ramesh Jain, and Terry Weymouth, "Digital video indexing in multimedia systems", appearing in Proceedings of the Workshop on Indexing and Reuse in Multimedia Systems, American Association of Artificial Intelligence, August 1994.

See further, for example, Zhang: 1) H. J. Zhang, A. Kankanhalli, and S. W. Smoliar, "Automatic partitioning of video", appearing in Multimedia Systems, 1(1):10-28, 1993; and 2) Hong Jiang Zhang, Yihong Gong, Stephen W. Smoliar, and Shuang Yeo Tan, "Automatic parsing of news video", appearing in Proceedings of the IEEE Conference on Multimedia Computing Systems, May 1994.

See also, for example, 1) Akio Nagasaka and Yuzuru Tanaka, "Automatic video indexing and full-video search for object appearances", appearing in 2nd Working Conference on Visual Database Systems, pages 119-133, Budapest, Hungary, October 1991, 2) Farshid Arman, Arding Hsu, and Ming-Yee Chiu, "Image processing on compressed data for large video databases", appearing in Proceedings of the ACM MultiMedia, pages 267-272, California, USA, June 1993, Association of Computing Machinery, 3) Glorianna Davenport, Thomas Aguirre Smith, and Natalio Pincever; op cit; 4) Eitetsu Oomoto and Katsumi Tanaka, op cit.; and 5) Akihito Akutsu, Yoshinobu Tonomura, Hideo Hashimoto, and Yuji Ohba, "Video indexing using motion vectors", appearing in Proceedings of SPIE: Visual Communications and Image Processing 92, November 1992.

When considering these references, it should be recalled that MPI video is already operative. Actual results obtained in the immersive video and visual telepresence expansions and applications of MPI video system will be presented in this specification.

2.3 Previous Scene-Interactive Video and Television

Scene-interactive video and television is nothing so grandiose as permitting a user/viewer to interact with the objects and/or events of a scene--as will be seen to be the subject of the present and related inventions. Rather, the interaction with the scene is simply that of a machine--a computer--that must recognize, classify and, normally, adapt its responses to what it "sees" in the scene. Scene-interactive video and television is thus simply an extension of machine vision so as to permit a computer to make decisions, sound alarms, etc., based on what it detects in, and detects to be transpiring in, a video scene. Two classic problems in this area (which problems are not commensurate in difficulty) are (i) security cameras, which must detect contraband, and (ii) an autonomous computer-guided automated battlefield tank, which must sense and respond to its environment.

U.S. Pat. No. 5,109,425 to Lawton for a METHOD AND APPARATUS FOR PREDICTING THE DIRECTION OF MOVEMENT IN MACHINE VISION concerns the detection of motion in and by a computer-simulated cortical network, particularly for the motion of a mobile rover. Interestingly, a subsystem of the present invention will be seen to capture the image of a moving mobile rover within a scene, and to classify the image captured to the rover and to its movement. However, the video and television systems of the present invention, and their MPI-video subsystem, will be seen to function quite differently than the method and apparatus of Lawton in the detection of motion. An MPI video system avails itself of multiple two-dimensional video images from each of multiple stationary cameras as are assembled into a three-dimensional video Image model, or database. Once these multiple images of the MPI video system are available for object, and for object track (i.e., motion), correlation(s), then it proves a somewhat simpler matter to detect motion in the MPI video system than in prior art single-perspective systems such as that of Lawton.

U.S. Pat. No. 5,170,440 to Cox for PERCEPTUAL GROUPING BY MULTIPLE HYPOTHESIS PROBABILISTIC DATA ASSOCIATION is a concept of a computer vision algorithm. Again, the video and television systems of the present invention are privileged to start with much more information than any single-point machine vision system. Recall that an MPI video system avails itself of multiple two-dimensional video images from each of multiple stationary cameras, and that these multiple two-dimensional images are, moreover, assembled into a three-dimensional video image model, or database.

The general concepts, and voluminous prior art, concerning "machine vision", "(target) classification", and "(target) tracking" are all relevant to the present invention. However, the video and television systems of the present invention--while doing very, very well in each of viewing, classifying and tracking, will be seen to come to these problems from a very different perspective than does the prior art. Namely, the prior art considers platforms--whether they are rovers or warships--that are "located in the world", and that must make sense of their view thereof from essentially but a single perspective centered on present location.

The present invention functions oppositely. It "defines the world", or at least so much of the world is "on stage" and in view to (each of) multiple video cameras. The video and television systems of the present invention have at their command a plethora of correlatable and correlated, simultaneous, positional information. Once it is known where each of multiple cameras are, and are pointing, it is a straightforward matter for computer processes to fix, and to track, items in the scene.

The systems, including the MPI-video subsystem, of the present invention will be seen to perform co-ordinate transformation of (video) image data (i.e., pixels), and to do this during a generation of two- and three-dimensional image models, or databases. U.S. Pat. No. 5,259,037 to Plunk for AUTOMATED VIDEO IMAGERY DATABASE GENERATION USING PHOTOGRAMMETRY discusses the conversion of forward-looking video or motion picture imagery into a database particularly to support image generation of a "top down" view. The present invention does not require any method so sophisticated as that of Plunk, who uses a Kalman filter to compensate for the roll, pitch and yaw of the airborne imaging platform: an airplane. In general the necessary image transformations of the present invention will be seen not to be plagued by dynamic considerations (other than camera pan and zoom)--the multiple cameras remaining fixed in position imaging the scene (in which scene the objects, however, may be dynamic).

Finally, U.S. Pat. No. 5,237,648 to Cohen for an APPARATUS AND METHOD FOR EDITING A VIDEO RECORDING BY SELECTING AND DISPLAYING VIDEO CLIPS shows and discusses some of the concerns, and desired displays, presented to a human video editor. In the systems of the present invention much of this function will be seen to be assumed by hardware.

The system of present invention will be seen to, in its rudimentary embodiment, perform a spatial positional calibration of each of multiple video cameras from the images produced by such cameras because, quite simply, in the initial test data the spatial locations of the cameras were neither controlled by, nor even known to, the inventors. This is not normally the case, and the multi-perspective video or the present invention normally originates from multiple cameras for which (i) the positions, and (ii) the zoom in/zoom out parameters, are well known, and fully predetermined, to the system. However, and notably, prior Knowledge of camera position(s) may be "reverse engineered" by a system from a camera(s') image(s). Two prior art articles so discussing this process are "A Camera Calibration Technique using Three Sets of Parallel Lines", by Tomino Echigo appearing in Machine Visions and Applications, 3;139-167 (1990); and "A theory of Self-Calibration of a Moving Camera", by S. J. Maybank and O. D. Faugeras appearing in International Journal of Computer Vision 8:2;123-151 (1992).

In general, many computer processes performed in the present invention are no more sophisticated than are the computer processes of the prior art, but they are, in very many ways, often greatly more audacious. The present invention will be seen to manage a very great amount of video data. A three-dimensional video model, or database, is constructed. For any sizable duration of video (and a sizable length thereof may perhaps not have to be retained at all, or at least retained long), this model (this database) is huge. More problematical, it takes very considerable computer "horsepower" to construct this model--howsoever long its video data should be held and used.

However, the inventors have already taken a major multi-media laboratory at a major university and "rushed in where angels fear to tread" in developing MPI video--a form of video presentation that is believed to be wholly new. Having found the "ground" under their invention to be firmer, the expected problems more tractable, the results better, and the images of greater practical usefulness than might have been expected, the inventors continue with expansion and adaptation of the MPI video system to realize untrammeled video views--including stereoscopic views. In non-real-time applications this realization, and the special processes of so realizing, are called "immersive video&. In real-time applications the realization, and the processes, are "immersive telepresence", or "visual reality", or "VisR". In particular the inventors continue to find--a few strategic simplifications being made--that presently-available computer and computer systems resources can produce results of probable practical value. Such is the subject of the following specification sections.

2.4 Previous Composite Video and Television

The present invention of immersive video will be seen to involve the manipulation, processing and compositing of video data in order to synthesize video images. (Video compositing is the amalgamation of video data from separate video streams.) It is known to produce video images that--by virtue of view angle, size, magnification, etc.--are generally without exact correspondence to any single "real-world" video image. The previous process of so doing is called "video mosaicing".

The present general interest in, and techniques for, generating a video mosaic from an underlying video sequence are explained, inter alia, by M. Hansen, P. Anandan, K. Dana, G. Van der Wal and P. Burt in Real-time scene stabilization and mosaic construction, appearing in ARPA Image Understanding Workshop, Monterey, Calif., Nov. 13-16, 1994; and also by H. Sawhney, S. Ayer, and M. Gorkani in Model-based 2D and 3D dominant motion estimation for mosaicing and video representation, appearing in Technical Report, IBM Almaden Research Center, 1994.

Video mosaicing has numerous applications including (1) data compression and (2) video enhancement. See M. Irani and S. Peleg, Motion Analysis for image enhancement: resolution, occlusion, and transparency, appearing in Journal of Visual Communication and Image Representation, 4(4):324-35, December 1993. Another application of video mosaicing is (3) the generation of panoramic views. See R. Szeliski, Image mosaicing for tele-reality applications, appearing in Proceedings of Workshop on Applications of Computer Vision, pages 44-53, Sarasota, Fla., December 1994, IEEE, IEEE Computer Society Press; L. McMillan, Acquiring immersive virtual environments with an uncalibrated camera, Technical Report TR95-006, Computer Science Department, University of North Carolina, Chapel Hill, N.C., April 1995; and S. Mann and R. W. Picard, Virtual Bellows: constructing high quality stills from video, Technical Report TR#259, Media Lab, MIT, Cambridge, Mass., November 1994. Still further applications of video mosaicing are (4) high-definition television and (5) video libraries.

The underlaying task in video mosaicing is to create larger images from frames obtained from one or more single cameras, typically one single camera producing a panning video stream. To generate seamless video mosaics, registration and alignment of the frames from a sequence are critical issues.

Simple, yet robust, techniques have been advanced to solve the registration and alignment challenges. See, for example, the multi-resolution area-based scheme described in M. Hansen, P. Anandan, K. Dana, G. Van der Wal and P. Burt, op cit. For scenes containing dynamic objects, parallax has been used to extract dominant 2D and 3D motions, which were then used to register the frames and generate the mosaic. See H. Sawhney, Motion video annotation and analysis: An overview, appearing in Proceedings 27 Asilomar Conference on Signals, Systems and Computers, pages 85-89, IEEE, November 1993.

For multiple dynamic objects in a scene, "motion layers" have been introduced. In these layers each dynamic object is assumed to more in a plane parallel to the camera. This permits segmentation of the video into different components each containing a dynamic object, which can then be interpreted or re-synthesized as a video stream. See J. Wang and E. Adelson, Representing Moving Images with Layers, IEEE Transactions on Image Processing, 3(4):625-38, September. 1994.

In general, previous activities in video mosaicing might be characterized as piecewise, and "from the bottom up" in developing new techniques. In contrast, the perspective of the present invention might be characterized as "top down". The immersive video system of the present invention will be seen to assimilate, and manipulate, a relatively large amount of video data. In particular, multiple independent streams of video data of the same scene at the same time will be seen to be input to the system. Because the system of the present invention has a plethora of information, it may well be relatively more straightforward for it to accomplish things like recognizing and classifying moving objects in a scene, or even to do exotic things like displaying stereoscopic scene images, than it is for previous systems handling less information. Video data from the real-world may well be a lot simpler for a machine (a computer) to interact with when, as derived from multiple perspectives, it is so abundant so as permit that objects and occurrences in the video scene should be interpretable without substantial ambiguity.

Notably, this concept is outside normal human ken: although we see with two eyes, we do not see things from all directions at the same time. Humans have, or course, highly evolved brains, and perception. However, at least one situation of limited analogy exists. At the IBM "people mover" pavilion at the 1957 World's Fair a multi-media, multi-screen presentation of the then-existing processes for the manufacturing of computers was shown to an audience inside an egg-shaped theater. A single process was shown in a lively way from as many as a dozen different views with abundant, choreographed, changes in perspective, magnification, relationship, etc, etc, between each and all simultaneous views. The audience retention, and comprehension, of the relatively new, and complex, information presented was considered exceptional when measured, thus showing that humans as well as, the inventors would argue, computers can benefit by having a "good look".

Next, the immersive video system of the present invention will be seen to use its several streams of 2D video data to build and maintain a 3D video model, or database. The utility of such 3D model, or database, in the synthesis of virtual video images seems clear. For example, an arbitrary planar view of the scene will contain the data of 2D planar slice "through" the 3D model.

The limitation on such a scheme of a information-intensive representation, and manipulation, of the video data of a real-world scene is that a purely "brute force" approach is impossible with presently available technology. The "trade-off" in handling a lot of video data is that (i) certain scene (or at least scene video) constraints must be imposed, (ii) certain simplifying assumptions must be made (regarding the content of the video information, (iii) certain expediencies must be embraced (regarding the manipulations of the video data), and/or (iv) certain limitations must be put on what images can, or cannot, be synthesized from such data. (The present invention will be seen to involve essentially no (iv) limitations on presentation.) Insofar as the necessary choices and trade-offs are astutely made, then it may well be possible to synthesize useful and aesthetically pleasing video, and even television, images by the use of tractable numbers of affordable computers and other equipments running software programs of reasonable size.

The immersive video system of the present invention will so show that--(i) certain scene constraints being made, (ii) certain simplifying assumptions being made regarding scene objects and object dynamical motions being made, and (iii) certain computational efficiencies in the manipulations of video data being embraced--it is indeed possible, and even practical, to so synthesize useful and aesthetically pleasing video, and even television, images.

SUMMARY OF THE INVENTION

The present invention contemplates telepresence and immersive video, being the non-real-time creation of a synthesized, virtual, camera/video image of a real-world scene, typically in accordance with one or more viewing criteria that are chosen by a viewer of the scene. The creation of the virtual image is based on a computerized video processing--in a process called hypermosaicing--of multiple video views of the scene, each from a different spatial perspective on the scene.

When the synthesis and the presentation of the virtual image transpires as the viewer desires--and particularly as the viewer indicates his or her viewing desires simply by action of moving and/or orienting any of his or her body, head and eyes--then the process is called "immersive telepresence", or simply "telepresence". Alternatively, the process is sometimes called "visual reality", or simply "VisR".

(The proliferation of descriptive terms has more to do with the apparent reality(ies) of the synthesized views drawn from the real-world scene than it does with the system and processes of the present invention for synthesizing such views. For example, a quite reasonable ground level view of a football quarterback as is may be synthesized by the system and method of the present invention may appear to a viewer to have been derived from a hand-held television camera, although in fact no such camera exists and the view was not so derived. These views of common experience are preliminarily called "telepresence". Contrast a magnified, eye-to-eye, view with an ant. This magnified view is also of the real-world, although it is clearly a view that is neither directly visible to the naked eye, nor of common experience. Although derived by entirely the same processes, views of this latter type of synthesized view of the real world is preliminarily called "visual reality", or "VisR", by juxtaposition of such views the similar sensory effects engendered by "virtual reality", or "VR".)

1. Telepresence, Both Immersive and Interactive

In one of its aspects, the present invention is embodied in a method of telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer. The method includes (i) capturing video of a real-world scene from each of a multiplicity of different spatial perspectives on the scene, (ii) creating from the captured video a full three-dimensional model of the scene, and (iii) producing, or synthesizing, from the three-dimensional model a video representation on the scene that is in accordance with the desired perspective on the scene of a viewer of the scene.

This method is thus called "immersive telepresence" because the viewer can view the scene as if immersed therein, and as if present at the scene, all in accordance with his or her desires. Namely, it appears to the viewer that, since the scene is presented as the viewer desires, the viewer is immersed in the scene. Notably, the viewer-desired perspective on the scene, and the video representation synthesized in accordance with this viewer-desired perspective, need not be in accordance with any of the video captured from any scene perspective.

The video representation can be in accordance with the position and direction of the viewer's eyes and head, and can exhibit "motional parallax". "Motional parallax" is normally and conventionally defined as a three-dimensional effect where different views on the scene are produced as the viewer moves position, making the viewer's brain to comprehend that the viewed scene is three-dimensional. Motional parallax is observable even if the viewer has but one eye.

Still further, and additionally, the video representation can be stereoscopic. "Stereoscopy" is normally and conventionally defined as a three-dimensional effect where each of the viewer's two eyes sees a slightly different view on the scene, thus making the viewer's brain to comprehend that the viewed scene is three-dimensional. Stereoscopy is detectable even should the viewer not move his or her head or eyes in spatial position, as is required for motional parallax.

In another of its aspects, the present invention is embodied in a method of telepresence where, again, video of a real-world scene is obtained from a multiplicity of different spatial perspectives on the scene. Again, a full three-dimensional model of the scene is created the from the captured video. From this three-dimensional model a video representation on the scene that is in accordance with a predetermined criterion--selected from among criteria including a perspective on the scene, an object in the scene and an event in the scene--is produced, or synthesized.

This embodiment of the invention is thus called "interactive telepresence" because the presentation to the viewer is interactive in accordance with the criterion. Again, the synthesized video presentation of the scene in accordance with the criterion need not be, and normally is not, equivalent to any of the video captured from any scene perspective.

In this method of viewer-interactive telepresence the video representation can be in accordance with a criterion selected by the viewer, thus viewer-interactive telepresence. Furthermore, the presentation can be in accordance with the position and direction of the viewer's eyes and head, and will thus exhibit motional parallax; and/or the presentation can exhibit stereoscopy.

2. A System for Generating Immersive Video

A huge range of heretofore unobtainable, and quite remarkable, video views may be synthesized in accordance with the present invention. Nonetheless that an early consideration of exemplary video views of diverse types would likely provide significant motivation to understanding the construction, and the operation, of the immersive video system described in this section 2, discussion of these views is delayed until the next section 3. This is so that the reader, having gained some appreciation and understanding in this section 2 of the immersive video system, and process, by which the video views are synthesized, may later better place these diverse views in context.

An immersive video, or telepresence, system serves to synthesize and to present diverse video images of a real-world scene in accordance with a predetermined criterion or criteria. The criterion or criteria of presentation is (are) normally specified by, and may be changed at times and from time to time by, a viewer/user of the system. Because the criterion (criteria) is (are) changeable, the system is viewer/user-interactive, presenting (primarily) -hose particular video images (of a real-world scene) that the viewer/user desires to see.

The immersive video system includes a knowledge database containing information about the scene. Existence of this "knowledge database" immediately means that the something about the scene is both (i) fixed and (ii) known; for example that the scene is of "a football stadium", or of "a stage", or even, despite the considerable randomness of waves, of "a surface of an ocean that lies generally in a level plane". For many reasons--including the reason that a knowledge database is required--the antithesis of a real-world scene upon which the immersive video system of the present invention may successfully operate is a scene of windswept foliage in a deep jungle.

The knowledge database may contain, for example, data regarding any of (i) the geometry of the real-world scene, (ii) potential shapes of objects in the real-world scene, (iii) dynamic behaviors of objects in the real-world scene, (iv) an internal camera calibration model, and/or (v) an external camera calibration model. For example, the knowledge base of an American football game would be something to the effect that (i) the game is played essentially in a thick plane lying flat upon the surface of the earth, this plane being marked with both (yard) lines and hash marks; (hi) humans appear in the scene, substantially at ground level; (iii) a football moves in the thick plane both in association with e.g., running plays and detached from (e.g., passing and kicking plays) the humans; and (iv) the locations of each of several video cameras on the football game are a priori known, or are determined by geometrical analysis of the video view received from each.

The system further includes multiple video cameras each at a different spatial location. Each of these multiple video cameras serves to produce a two-dimensional video image of the real-world scene at a different spatial perspective. Each of these multiple cameras can typically change the direction from which it observes the scene, and can typically pan and zoom, but, at least in the more rudimentary versions of the immersive video system, remains fixed in location. A classic example of multiple stationary video cameras on a real-world scene are the cameras at a sporting event, for example at an American football game.

The system also includes a viewer/user interface. A prospective viewer/user of the scene uses this interface to specify a criterion, or several criteria, relative to which he or she wishes to view the scene. This viewer/user interface may commonly be anything from head gear mounted to a boom to a computer joy stick to a simple keyboard. In ultimate applications of the immersive video system of the present invention, the viewer/user who establishes (and re-establishes) the criterion (criteria) by which an image on the scene is synthesized is the final consumer of the video images so synthesized and presented by the system. However, for more rudimentary present versions of the immersive video system, the control input(s) arising at the viewer/user interface typically arise from a human video sports director (in the case of an athletic event), from a human stage director (in the case of a stage play), or even from a computer (performing the function of a sports director or stage director). In other words, the viewing desires of the ultimate viewer/user may sometimes be translated to the immersive video system through an intermediary agent that may be either animate or inanimate.

The immersive video system includes a computer running a software program. This computer receives the multiple two-dimensional video images of the scene from the multiple video cameras, and also the viewer-specified criterion (criteria) from the viewer interface. At the present time, circa 1995, the typical computer functioning in an immersive video system is fairly powerful. It is typically an engineering work station class computer, or several such computers that are linked together if video must be processed in real time--i.e., as television. Especially if the immersive video is real time--i.e., as television--then some or all of the computers normally incorporate hardware graphics accelerators, a well-known but expensive part for this class of computer. Accordingly, the computer(s) and other hardware elements of an immersive video system are both general purpose and conventional but are, at the present time (circa 1995) typically "state-of-the-art", and of considerable cost ranging to tens, and even hundreds, of thousands of American dollars.

The system computer includes (in software and/or in hardware) (i) a video data analyzer for detecting and for tracking objects of potential interest and their locations in the scene, (ii) an environmental model builder for combining multiple individual video images of the scene to build a three-dimensional dynamic model of the environment of the scene within which three-dimensional dynamic environmental model potential objects of interest in the scene are recorded along with their instant spatial locations, (iii) a viewer criterion interpreter for correlating the viewer-specified criterion with the objects of interest in the scene, and with the spatial locations of these objects, as recorded in the dynamic environmental model in order to produce parameters of perspective on the scene, and (iv) a visualizer for generating, from the three-dimensional dynamic environmental model in accordance with the parameters of perspective, a particular two-dimensional video image of the scene.

The computer Function (i)--The video data analyzer--is a machine vision function. The function can presently be performed quite well and quickly, especially if (i) specialized video digitalizing hardware is used, and/or (ii) simplifying assumptions about the scene objects are made. Primarily because of the scene model builder next discussed, abundant simplifying assumptions are both well and easily made in the immersive video system of the present invention. For example, it is assumed that, in a video scene of an American football game, the players remain essentially in and upon the thick plane of the football field, and do not "fly" into the airspace above the field.

The views provided by an immersive video system in accordance with the present invention not yet having been discussed, it is somewhat premature to explain how a scene object that is not in accordance with the model may suffer degradation in presentation. More particularly, the scene model is not overly particular as to what appears within the scene, but it is particular as to where within (the volume of) the scene an object to be modeled appears. Consider, for example, that the immersive video system can fully handle a scene-intrusive object that is not in accordance with prior simplifications--for example, a spectator or many spectators or a dog or even an elephant walking onto a football field during or after a football game--and can process these unexpected objects, and object movements quite as well as any other. However, if is necessary that the modeled object should appear within a volume of the real-world scene whereat the scene model is operational--basically that volume portion of the scene where the field of view of multiple cameras overlap. For example, a parachutist parachuting into a football stadium may not be "well-modeled" by the system when he/she is high above the field, and outside the thick plane, but will be modeled quite well when finally near, or on, ground level. By modeling "quite well", it is meant that, while the immersive video system will readily permit a viewer to examine, for example, the dentation of the quarterback if he or she is interested in staring the quarterback "in the teeth", it is very difficult for the system (especially initially, and in real time as television), to process through a discordant scene occurrence, such as the stadium parachutist, so well so as to permit the examination of his or her teeth also when the parachutist is still many meters above the field.

The computer function (ii)--the environmental model builder--is likely the "backbone" of the present invention. It incorporates important assumptions that, while scene specific, are generally of a common nature throughout all scenes that are of interest for viewing with the present invention.

In the first place, the environmental model is (i) three-dimensional (3D), having both (i) static and (ii) dynamic components. The scene environmental model is not the scene image, nor is it the scene images rendered three-dimensionally. The current scene image, such as of the play action on a football field, may be, and typically is, considerably smaller than the scene environmental model which may be, for example, the entire football stadium and the objects and actors expected to be present therein. Within this three-dimensional dynamic environmental model both (i) the scene and (ii) all potential objects of interest in the scene are dynamically recorded as associated with, or "in", their proper instant spatial locations. (It should be remembered that the computer memory in which this 3D model is recorded as actually one-dimensional (1D), being but memory locations each of which is addressed by but a single one-dimensional (1D) address.) Understanding that the scene environmental model, and the representation of scene video information, in the present invention is full 3D will much simplify understanding of how the remarkable views discussed in the next section are derived.

At present there is not enough computer "horsepower" to process a completely amorphous unstructured video scene--the windy jungle--into 3D, especially in real time (i.e., as television). It is, however, eminently possible to process many scenes of great practical interest and importance into 3D if and when appropriate simplifying assumptions are made. In accordance with the present invention, these necessary simplifying assumptions are very effective, making that production of the three-dimensional video model (in accordance with the 3D environmental model) is very efficient.

First, the static "underlayment" or "background" of any scene is pre-processed into the three-dimensional video model. For example, the video model of an (empty) sports stadium--the field, filed markings, goal posts, stands, etc.--is pre-processed (as the environmental model) into the three-dimensional video model. From this point on only the dynamic elements in the scene--i.e., the players, the officials, the football and the like--need be, and are, dealt with. The typically greater portion of any scene that is (at any one time) static is neither processed nor re-processed from moment to moment, and from frame to frame. It need not be so processed or re-processed because nothing has changed, nor is changing. (In some embodiments of the immersive video system, the static background is not inflexible, and may be a "rolling" static background based on the past history of elements within the video scene.)

Meanwhile, dynamical objects in the scene--which objects typically appear only in a minority of the scene (e.g. the football players) but which may appear in the entire scene (e.g., the crowd)--are preferably processed in one of two ways. If the computer recognition and classification algorithm can recognize--in consideration of a priori model knowledge of objects appearing in the scene (such as the football, and the football players) and where such objects will appear (in the "thick plane" of the football field)--an item in the scene, than Shag item will be isolated, and will be processed/re-processed into the three-dimensional video model as a multiple voxel representation. (A voxel is a three-dimensional pixel.)

Other dynamic elements of the scene that--primarily for lack of suitably different, and suitably numerous, view perspectives from multiple cameras--cannot be classified or isolated into the three-dimensional environmental model are swept up into the three-dimensional model mostly in their raw, two-dimensional, video data form. Such a dynamic, but un-isolated, video element could be, for example, the movement of a crowd doing a "wave" motion at a sports stadium, or the surface of the sea.

As will be seen, those recognized and classified objects in the three-dimensional video model--such as, for example, a football or a football player--can later be viewed (to the limits of being obscured in all two-dimensional video data streams from which the three-dimensional video scene is composed) from any desired perspective. But it is not possible to view those unclassified and un-isolated dynamic elements of the scene that are stored in the 3D video model in their 2D video data from any random perspective. The 2D dynamic objects can indeed be dynamically viewed, but it is impossible for the system to, for example, go "behind" the moving crowd, or "under" the undulating surface of the sea.

The system and method does not truly know, of course, whether it is inserting, or warping, into the instant three-dimensional video model (that is based upon the scene environmental model) an instant video image of a football quarterback taking a drink, an instant video image of a football San taking the same drink, or an instant video image of an elephant taking a drink. Moreover, dynamic objects can both enter (e.g. as in coming onto the imaged field of play) and exit (e.g. as in leaving the imaged field of play) the scene.

The system and method of the present invention for constructing a 3D video scene deals only with (i) the scene environmental model, and (ii) the mathematics of the pixel dynamics. What must be recognized is that, in so doing, the system and method serve to discriminate between and among raw video image data in processing such image data into the three-dimensional video model.

These assumptions that the real-world scene contains both static and dynamic elements (indeed, preferably two kinds of dynamic elements), this organization, and these expediencies of video data processing are very important. They are collectively estimated to reduce the computational requirements for the maintenance of a 3D video model a typical real-world scene of genuine interest by a factor of from fifty to one hundred times (.times.50 to .times.100).

However, these simplifications have a price; thankfully normally one that is so small so as to be all but unnoticeable. Portions of the scene "where the action is, or has been" are entered into the three-dimensional video model quite splendidly. Viewers normally associate such "actions areas" with the center of their video or television presentation. When action spontaneously erupts at the periphery of a scene, it takes even our human brains--whose attention has been focused elsewhere (i.e., at the scene center)--several hundred milliseconds or so to recognize what has happened. So also, but in a different sense, it is possible to "sandbag" the system and method of the present invention by a spontaneous eruption of action, or dynamism, in an insufficiently multiply viewed (and thus previously unclassified) scene area. The system and method of the present invention finds it hard to discriminate, and hard to process for entrance into the three-dimensional model, a three-dimensional scene object (or actor) outside of the boundaries where it expects scene objects (or actors). Without a priori knowledge in the scene environmental model that a spectator may throw a bottle in a high arc into a sporting arena, it is hard for the system of the present invention to classify and to process either portions of the throw or the thrower--both of which images outside the volume where image classification and 3D modeling transpires and both poorly covered by multiple video cameras--into the three-dimensional model so completely that the facial features of the thrower and/or the label on the bottle may--either upon an "instant replay" of the scene focusing on the area of the perpetrator or for that rare viewer who had been focusing his view to watch the crowd instead of the athletes all along--immediately be recognized. (If the original raw video data streams still exist, then it is always possible to process them better.)

It will further be understood when the synthesized scene images are finally discussed and viewed, that the 3D modeling cannot successfully transpire even on expected objects (e.g., football players) in expected volumes (e.g., on the football field) if the necessary views are obscured. For example, the immersive video system in accordance with the present invention does not make it possible to see to the bottom of a pile of tacklers (where no camera image exists, let alone do multiple camera images exist). The immersive video system in accordance with the present invention will, however, certainly permit observation of the same pile from the vantage point of a referee in order to assess, for example, an occurrence of a "piling on" penalty.

Finally, the algorithms themselves that are used to produce the three-dimensional video model are efficient.

Lastly, the system includes a video display that receives the particular two-dimensional video image of the scene from the computer, and that displays this particular two-dimensional video image of the real-world scene to the viewer/user as that particular view of the scene which is in satisfaction of the viewer/user-specified criterion (criteria).

3. Scene Views Obtainable With Immersive Video

To immediately note that a viewer/user of an immersive video system in accordance with the present invention may view the scene from any static or dynamic viewpoint--regardless that a real camera/video does not exist at the chosen viewpoint--only but starts to describe the experience of immersive video.

Literally any video image(s) can be generated. The immersive video image(s) that is (are) actually displayed to the viewer/user are ultimately, in one sense, a function of the display devices, or the arrayed display devices--i.e., the television(s) or monitor(s)--that are available for the viewer/user to view. Because, at present (circa 1995), the most ubiquitous form of these display devices--televisions and monitors--have substantially rectangular screens, most of the following explanations of the various experiences of immersive video will be couched in terms of the planar presentations of these devices. However, when in the future new display devices such as volumetric three-dimensional televisions are built--see, for example, U.S. Pat. Nos. 5,268,862 and 5,325,324 each for a THREE-DIMENSIONAL OPTICAL MEMORY--then the system of the present invention will stand ready to provide the information displayed by these devices.

3.1 Planar Video Views on a Scene

First, consider the generation of one-dimensional, planar and curved surface, video views on a scene.

Any "planar" view on the scene may be derived as the information which is present on any (straight or curved) plane (or other closed surface, such as a saddle) that is "cut" through the three-dimensional model of the scene. This "planar" surface may, or course, be positioned anywhere within the three-dimensional volume of the scene model. Literally any interior or exterior virtual video view on the scene may be derived and displayed. Video views may be presented in any aspect ratio, and in any geometric form that is supported by the particular video display, or arrayed video displays (e.g., televisions, and video projectors), by which the video imagery is presented to the viewer/user.

Next, recall that a plane is but the surface of a sphere or cylinder of infinite radius. In accordance with the present invention, a cylindrical, hemispherical, or spherical panoramic view of a video scene may be generated from any point inside or outside the cylinder, hemisphere, or sphere. For example, successive views on the scene may appear as the scene is circumnavigated from a position outside the scene. An observer at the video horizon of the scene will look into the scene as if though a window, with the scene in plan view, or, if foreshortened, as if viewing the interior surface of a cylinder or a sphere from a peephole in the surface of the cylinder or sphere. In the example of an American football game, the viewer/user could view the game in progress as if he or she "walked" at ground level, or even as if he or she "flew at low altitude", around or across the field, or throughout the volume of the entire stadium.

A much more unusual panoramic cylindrical, or spherical "surround" view of the scene may be generated from a point inside the scene. The views presented greatly surpass the crude, but commonly experienced, example of "you are there" home video where the viewer sees a real-world scene unfold as a walking video cameraman shoots video of only a limited angular, and solid angular, perspective on the scene. Instead, the scene can be made to appear--especially when the display presentation is made so as to surround the user as do the four walls of a room or as does the dome of a planetarium--to completely encompass the viewer. In the example of an American football game, the viewer/user could view the game in progress as if he or she was a player "inside" the game, even to the extent of looking "outward" at the stadium spectators.

It should be understood that where the immersive video system has no information--normally because view is obscured to the several cameras--than no Image can be presented of such a scene portion, which portion normally shows black upon presentation. This is usually not objectionable; the viewer/user does not really expect to be able to see "under" the pile of football players, or from a camera view "within" the earth. Note, however, that when the 3D video model does contain more than just surface imagery such as, for example, the complete 3D human physiology (the "visible man"), then "navigation" "inside" solid objects, into areas that have never been "seen" by eye or by camera, and at non-normal scales of view is totally permissible.)

Notably, previous forms of displaying multi-perspective, and/or surround, video presently (circa 1995) suffer from distortion. Insofar as the view caught at the focal plane of the camera, or each camera (whether film or video) is not identical to the view recreated for the viewer, the (often composite) views suffer from distortion, and to that extent a composite view lacks "reality"--even to the point of being disconcerting. However--and considering again that each and all views presented by an immersive video system in accordance with the present invention are drawn from the volume of a three-dimensional model--there is absolutely no reason that each and every view produced by an immersive video system should not be of absolute fidelity and correct spatial relationship to all other views.

For example, consider first the well known, but complex, pincushion correction circuitry of a common television. This circuitry serves to match the information modulation of the display-generating electron beam to the slightly non-planar, pincushion-like, surface of a common cathode ray tube. If the information extracted from a three-dimensional video model in accordance with the present invention is so extracted in the contour of a common pincushion, then no correction of the information is required in presenting it on an equivalent pincushion surface of a cathode ray tube.

Taking this analogy to the next level, if a scene is to be presented on some selected panels of a Liquid Crystal Digital (LCD) display, or walls of a room, then the pertinent video information as would constitute a perspective on the scene at each such panel or wall is simply withdrawn from the three-dimensional model. Because they are correctly spatially derived from a seamless 3D model, the video presentations on each panel or wall fit together seamlessly, and perfectly.

By now, this capability of the immersive video of the present invention should be modestly interesting. As well as commonly lacking stereoscopy, the attenuation effects of intervening atmosphere, true color fidelity, and other assorted shortcomings, two-dimensional screen views of three-dimensional real world scenes suffer in realism because of subtle systematic dimensional distortion. The surface of the two-dimensional display screen (e.g., a television) is seldom so (optically) flat as is the surface of the Charge Coupled Device (CCD) of a camera providing a scene image. The immersive video system of the present invention straightens all this out, exactly matching (in dedicated embodiments) the image presented to the particular screen upon which the image is so presented. This is, of course, a product of the 3D video model which was itself constructed from multiple video streams from multiple video cameras. It might thus be said that the immersive video system of the present invention is using the image of one (or more) cameras to "correct" the presentation (not the imaging, the presentation) of an image derived (actually synthesized in part) from another camera!

3.2 Interactive Video Views on a Scene

Second, consider that immersive video in accordance with the present invention permits machine dynamic generation of views on a scene. Images of a real-world scene may be linked at the discretion of the viewer to any of a particular perspective on the scene, an object in the scene, or an event in the scene.

For example, consider again the example of the real-world event of an American football game. A viewer/user may interactively close to view a field goal attempt from the location of the goalpost crossbars (a perspective on the scene), watching a successful place kick sail overhead. The viewer/user may chose to have the football (an object in the scene) centered in a field of view that is 90.degree. to the field of play (i.e., a perfect "sideline seat") at all times. Finally, the viewer/user may chose to view the scene from the position of the left shoulder of the defensive center linebacker unless the football is launched airborne (as a pass) (an event in the scene) from the offensive quarterback, in which case presentation reverts to broad angle aerial coverage of the secondary defensive backs.

The present and related inventions serve to make each and any viewer of a video or a television depicting a real-world scene to be his or her own proactive editor of the scene. The viewer as "editor" has the ability to interactively dictate and select--in advance of the unfolding of the scene, and by high-level command--any reasonable parameter or perspective by which the scene will be depicted, as and when the scene unfolds.

3.3 Stereoscopic Video Views on a Scene

Third, consider that (i) presentations in consideration of motion parallax, and (ii) stereoscopy, are inherent in immersive video in accordance with the present invention.

Scene views are constantly generated by reference to the content of a dynamic three-dimensional model--which model is sort of a three-dimensional video memory without the storage requirement of a one-to-one correspondence between voxels (solid pixels) and memory storage addresses. Consider stereoscopy. It is "no effort at all" for an immersive video system to present, as a selected stream of video data containing a selected view, first scan time video data and second scan time video data that is displaced, each relative to the other, in accordance with the location of each object depicted along the line of view.

This is, of course, the basis of stereoscopy. When one video stream is presented in a one color, or, more commonly at present, at a one time or in a one polarization, while the other video stream is presented in a separate color, or at a separate time, or in an orthogonal polarization, and each stream is separately gated to the eye (at greater than the eye flicker fusion frequency=70 Hz) by action of colored glasses, or time-gated filters, or polarizing filters, then the image presented to the eyes will appear to be stereoscopic, and three-dimensional. The immersive video of the present invention, with its superior knowledge of the three-dimensional spatial positions of all objects in a scene, excels in such stereoscopic presentations (which stereoscopic presentations are, alas, impossible to show on he one-dimensional pages of the drawings).

Presentations in consideration of motion parallax require feedback to the immersive video system of the position and orientation of the viewer head and eyes. Once this is known, however, as from a helmet worn by the viewer, the system can easily synthesize and present the views appropriate to viewer eye position and orientation, even to the extent of exhibiting motion parallax.

3.4 A Combination of Visual Reality and Virtual Reality

Fourth, the immersive video presentations of the present invention are clearly susceptible of combination with the objects, characters and environments of artificial reality.

Computer models and techniques for the generation and presentation of artificial reality commonly involve three-dimensional organization and processing, even if only for tracing light rays for both perspective and illumination. The central, "cartoon", characters and objects are often "finely wrought", and commonly appear visually pleasing. Alas, equal attention cannot be paid to each and every element of a scene, and the scene background to the focus characters and objects is often either stark, or unrealistic, or both.

Immersive video in accordance with the present invention provides the vast, relatively inexpensive, "database" of the real world (at all scales, time compressions/expansions, etc.) as a suitable "field of operation" (or "playground") for the characters of virtual reality.

When it is considered that immersive video permits viewer/user interactive viewing of a scene, then it is straightforward to understand that a viewer/use may "move" in and though a scene in response to what he/she "sees" in a composite scene of both a real, and an artificial virtual, nature. It is therefore possible, for example, to interactively flee from a "dinosaur" (a virtual animal) appearing in the scene of a real world city. It is therefore possible, for example, to strike a virtual "baseball" (a virtual object) appearing in the scene of a real world baseball park. It is therefore possible, for example, to watch a "tiger", or a "human actor" (both real animal) appearing in the scene of a virtual landscape (which landscape has been laid out in consideration of the movements of the tiger or the actor).

Note that (i) visual reality and (ii) virtual reality can, in accordance with the present invention, be combined with (1) a synthesis of real/virtual video images/television pictures of a combination real-world/virtual scene wherein the synthesized pictures are to user-specified parameters of presentation, e.g. panoramic or at magnification if so desired by the user, and/or (2) the synthesis of said real/virtual video images/television pictures can be 3D stereoscopic.

4. The Method of the Present Invention, In Brief

In brief, the present invention assumes, and uses, a three-dimensional model of the (i) static, and (ii) dynamic, environment of a real-world scene--a three-dimensional, environmental, model.

Portions of each of multiple video streams showing a single scene, each from a different spatial perspective, that are identified to be (then, at the instant) static by a running comparison are "warped" onto the three-dimensional environmental model. This "warping" may be into 2D (static) representations within the 3D model--e.g., a football field as is permanently static or even a football bench as is only normally static--or, alternatively, as a reconstructed 3D (static) object--e.g., the goal posts.

The dynamic part of each video stream (that rises from a particular perspective) is likewise "warped" onto the three-dimensional environmental model. Normally the "warping" of dynamic objects is into a reconstructed three-dimensional (dynamic) objects--e.g., a football player. This is for the simple reason that dynamic objects in the scene are of primary interest, and it is they that will later likely be important in synthesized views of the scene. However, the "warping" of a dynamic object may also be into a two-dimensional representation--e.g., the stadium crowd producing a wave motion.

Simple changes in video data determine whether an object is (then) static or dynamic.

The environmental model itself determines whether any scene portion or scene object is to be warped onto itself as a two-dimensional representation or as a reconstructed three-dimensional object. The reason no attempt is made to reconstruct everything in three-dimensions are twofold. First, video data i slacking to model everything in and about the scene in three dimensions--e.g., the underside of the field or the back of the crowd are not within any video stream. Second, and more importantly, there is insufficient computational power to reconstruct a three-dimensional video representation of everything that s within a scene, especially in real time (i.e., as television).

Any desired scene view is then synthesized (alternatively, "extracted") from the representations and reconstituted objects that are (both) within the three-dimensional model, and is displayed to a viewer/user.

The synthesis/extraction may be in accordance with a viewer specified criterion, and may be dynamic in accordance with such criterion. For example, the viewer or a football game may request a consistent view from the "fifty yard line", or may alternatively ask to see all plays from the a stadium view at the line of scrimmage. The views presented may be dynamically selected in accordance with an object in the scene, or an event in the scene.

Any interior or exterior -perspectives on the scene may be presented. For example, the viewer may request a view looking into a football game from the sideline position of a coach, or may request a view looking out of the football game from at the coach from the then position of the quarterback on the football field. Any requested view may be panoramic, or at any aspect ratio, in presentation. Views may also be magnified, or reduced in size.

Finally, any and all views can be rendered stereoscopically, as desired.

The synthesized/extracted video views may be processed in real time, as television.

Any and all synthesized/extracted video views contain only as much information as is within any of the multiple video streams; no video view can contain information that is not within any video stream, and will simply show black (or white) in this area.

5. The System of the Present Invention, In Brief

In brief, the computer system of the present invention receives multiple video images of view on a real world scene, and serves to synthesize a video image of the scene which synthesized image is not identical to any of the multiple received video images.

The computer system includes an information base containing a geometry of the real-world scene, shapes and dynamic behaviors expected from moving objects in the scene, plus, additionally, internal and external camera calibration models on the scene.

A video data analyzer means detects and tracks objects of potential interest in the scene, and the locations of these objects.

A three-dimensional environmental model builder records the detected and tracked objects at their proper locations in a three-dimensional model of the scene. This recording is in consideration of the information base.

A viewer interface is responsive to a viewer of the scene to receive a viewer selection of a desired view on the scene. This selected and desired view need not be identical to any views that are within any of the multiple received video images.

Finally, a visualizer generates (alternatively, "synthesizes") (alternatively "extracts") from the three-dimensional model of the scene, and in accordance with the received desired view, a video image on the scene that so shows the scene from the desired view.

These and other aspects and attributes of the present invention will become increasingly clear upon reference to the following drawings and accompanying specification.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1, consisting of FIGS. 1a through 1c, is a diagrammatic view showing how immersive video in accordance with the present invention uses video streams from multiple strategically-located cameras that monitor a real-world scene from different spatial perspectives.

FIG. 2 is a schematic block diagram of the software architecture of the immersive video system in accordance with the present invention.

FIG. 3 is a pictorial view showing how the video data analyzer portion of the immersive video system of the present invention detects and tracks objects of potential interest and their locations in the scene.

FIG. 4 is a diagrammatic view showing how, in an immersive video system in accordance with the present invention, the three-dimensional (3D) shapes of all moving objects are found by intersecting the viewing frustrums of objects found by the video data analyzer; two views of a full three-dimensional model generated by the environmental model builder of the immersive video system of the present invention for an indoor karate demonstration being particularly shown.

FIG. 5 is a pictorial view showing how, in the immersive video system in accordance with the present invention, a remote viewer is able to walk though, and observe a scene from anywhere using virtual reality control devices such as the boom shown here.

FIG. 6, consisting of FIGS. 6a through 6d, is original video frames showing video views from four cameras simultaneously recording the scene of a campus courtyard at a particular instant of time.

FIG. 7 is four selected virtual camera, or synthetic video, images taken from a 116-frame "walk through" sequence generated by the immersive video system in accordance with the present invention (color differences in the original color video to are lost in monochrome illustration).

FIG. 8, consisting of FIGS. 8a through FIG. 8c, are synthetic video images generated from original video by the immersive video system in accordance with the present invention, the synthetic images respectively showing a "bird's eye view", a ground level view, and a panoramic view of the same courtyard previously seen in FIG. 6 at the same instant of time.

FIG. 9a is a graphical rendition of the 3D environment model generated for the same time instant shown in FIG. 6b, the volume of voxels in the model intentionally being at a scale sufficiently coarse so that the 3D environmental model of two humans appearing in the scene may be recognized without being so fine that it cannot be recognized that it is only a 3D model, and not an image, that is depicted.

FIG. 9b is a graphical rendition of the full 3D environment model generated by the environmental model builder of the immersive video system of the present invention for an indoor karate demonstration as was previously shown in FIG. 4, the two human participants being clothed in karate clothing with a kick in progress, the scale and the resolution of the model being clearly observable.

FIG. 9c is another graphical rendition of the full 3D environment model generated by the environmental model builder of the immersive video system of the present invention, this time for an outdoor karate demonstration, this time the environmental model being further shown to be located in the static scene, particularly of an outdoor courtyard.

FIG. 10, consisting of FIG. 10a through 10h, are successive synthetic video frames created by the immersive video system of the present invention at various user-specified viewpoints during an entire performance of a outdoor karate exercise by an actor in the scene, the 3D environmental model of which outdoor karate exercise was previously seen in FIG. 9c.

FIG. 11 is a listing of Algorithm 1, the Vista "Compositing" or "Hypermosaicing" Algorithm, in accompaniment to a diagrammatic representation of the terms of the algorithm, of the present invention where, at each time instant, multiple vistas are computed using the current dynamic model and video streams from multiple perspective; for stereoscopic presentations vistas are created from left and from right cameras.

FIG. 12 is a listing of Algorithm 2, the Voxel Construction and Visualization for Moving Objects Algorithm in accordance with the present invention.

FIG. 13, consisting of FIGS. 13a through 13c, are successive synthetic video frames, similar to the frames of FIG. 10, created by the immersive video system of the present invention at various user-specified viewpoints during an entire performance of a indoor karate exercise by an actor in the scene, the virtual views of an indoor karate exercise of FIG. 13 being rendered at a higher resolution than were the virtual views of the outdoor karate exercise of FIG. 10.

FIG. 14, consisting of FIGS. 14 and 14a, respectively show left eye image and right eye image synthetic video frames of the indoor karate exercise previously seen in FIG. 13.

FIG. 15, consisting of FIGS. 15a and 15b, are views respectively similar to FIGS. 14 and 14a again respectively showing a left eye image and a right eye image synthetic video frames of the indoor karate exercise previously seen in FIG. 13.

FIG. 16, consisting of FIGS. 15a through 16b, are synthetic video frames, similar to the frames of FIGS. 10 and 13, created by the immersive video system of the present invention at various user-specified viewpoints during an entire performance of a basketball game, the virtual views of the basketball game of FIG. 16 being rendered at a still higher resolution than were the virtual views of the outdoor karate exercise of FIG. 10 or the indoor karate exercise of FIG. 13.

FIG. 17 is a block diagram of the preferred hardware system for realizing immersive video in accordance with the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENT

Because it provides a comprehensive visual record of environment activity, video data is an attractive source of information for the creation of "virtual worlds" which, nonetheless to being virtual, incorporate some "real world" fidelity. The present invention concerns the use of multiple streams of video data for the creation of immersive, "visual reality", environments.

The immersive video system of the present invention for so synthesizing "visual reality" from multiple streams of video data is based on, and is a continuance of the inventors' previous work directed to Multiple Perspective Interactive Video (MPI-Video), which work is the subject of the related predecessor patent application. An immersive video system incorporates the MPI-Video architecture, which architecture provides the infrastructure for the processing and the analysis of multiple streams of video data.

The MPI-Video portion of the immersive video system (i) performs automated analysis of the raw video and (ii) constructs a model of the environment and object activity within the environment. This model, together with the raw video data, can be used to create immersive video environments. This is the most important, and most difficult, functional portion of the immersive video system. Accordingly, this MPI-Video portion of the immersive video system is first discussed, and actual results from an immersive "virtual" walk through as processed by the MPI-Video portion of the immersive video system are presented.

1. The Motivation for Immersive Video

As computer applications that model and interact with the real-world increase in numbers and types, the term "virtual world" is becoming a misnomer. These applications, which require accurate and real-time modeling of actions and events in the "real world" (e.g., gravity), interact with a world model either directly (e.g., "telepresence") or in a modified form (e.g., augmented reality). A variety of mechanisms can be employed to acquire data about the "real world" which is then used to construct a model of the world for use in a "virtual" representation.

Long established as a predominant medium in entertainment and sports, video is now emerging as a medium of great utility in science and engineering as well. It thus comes as little surprise that video should find application as a "sensor" in the area of "virtual worlds." Video is especially useful in cases where such "virtual worlds" might usefully incorporate a significant "real world" component. These cases turn out to be both abundant and important; basically because we all live in, and interact with, the real world, and not inside a computer video game. Therefore, those sensations and experiences that are most valuable, entertaining and pleasing to most people most of the time are sensations and experiences of the real world, or at least sensations and experiences that have a strong real-world component. Man cannot thrive on fantasy alone (which state is called insanity); a good measure of reality is required.

In one such use of video as a "sensor", multiple video cameras cover a dynamic, real-world, environment. These multiple video data streams are a useful source of information for building, first, accurate three-dimensional models of the events occurring in the real world, and, then, completely immersive environments. Note that the immersive environment does not, in accordance with the present invention, come straight from the real world environment. The present invention is not simply a linear, brute-force, processing of two-dimensional (video) data into a three-dimensional (video) model (and the subsequent uses thereof). Instead, in accordance with the present invention, the immersive environment comes to exist through a three-dimensional model, particularly a model of real-world dynamic events. This will later become clearer such as in, inter alia, the discussion of FIG. 16a.

In the immersive video system of the present invention, visu