The emerging MPEG-4 standard specifies an object-based audiovisual representation framework, integrating both natural and synthetic content. Tools supporting 3D facial animation will be standardized for the first time. To support facial animation decoders with different degrees of complexity, MPEG-4 uses a profiling strategy, which foresees the specification of object types, profiles and levels adequate to the various relevant application classes. This paper gives first an overview on the MPEG-4 facial animation technology. Subsequently the paper describes the IST implementation of an MPEG-4 facial animation system, and briefly evaluates the performance of the various tools standardized, using the MPEG-4 test material.
The fast evolution of digital technology in the last decade has deeply transformed the way by which information, notably visual information, is generated, processed, transmitted, stored and consumed. Nowadays more and more applications include visual information and the user is becoming more and more interactive in his relationship with visual information. With the digital revolution, it became possible to more deeply exploit a well known concept: the more is known about the content, the better can be its representation, processing, etc. in terms of efficiency, efficacy, and allowed functionalities. In fact, in the world of natural visual data - video, strong limitations result from the way by which video data is acquired and subsequently represented, the so-called frame-based representation. This video data model is the basis of the entire analogue and digital video representation standards available today, namely PAL, SECAM, NTSC, H.261, H263, MPEG-1 & MPEG-2.
Recognizing that audiovisual content should be represented using a framework that is able to give the user as many as possible real-world-like capabilities, the Moving Pictures Experts Group (MPEG) decided, in 1993, to launch a new project, well known as MPEG-4. MPEG-4 is the first audiovisual representation standard modeling an audiovisual scene as a composition of audiovisual objects with specific characteristics and behavior, notably in space and time 1,2. The object composition approach allows MPEG-4 to support for new functionalities, such as content-based interaction and manipulation, as well as improvements to already available functionalities, such as coding efficiency and error resilience, by just using for each type of data the most adequate coding technology 2.
One of the most exciting and powerful consequences of the object-based approach is the integration of natural and synthetic content. Until now the natural and synthetic audiovisual worlds have evolved quite in parallel. The MPEG-4 representation approach allows the composition of natural and synthetic data in the same scene, unifying the two separate, but complementary, worlds. This unification allows MPEG-4 to efficiently represent natural as well as synthetic visual data, without undue translations like the conversion to pixel-based representations of synthetic models. Another powerful consequence of this strategy is that the conversion to the pixel and audio sample domains of a composition of various natural and synthetic objects is deferred to the receiving terminal where locally specified user controls and viewing/listening conditions may determine the final content experience.
To fulfill the objectives proposed, notably in the area of synthetic content, MPEG created a new sub-group, called Synthetic and Natural Hybrid Coding (SNHC), which had in practice the task to address the issues related to synthetic data, notably representation and synchronization. After a long collaborative process, MPEG-4 Version 1 reached, in October 1998, the Final Draft International Standard (FDIS) status, which is a very last stage of an ISO International Standard (IS), including technology that is fully mature and deeply tested. The SNHC topics which found their way in the MPEG-4 Systems 3, Visual 4 and Audio 5 Final Draft International Standards are: 3D facial animation, wavelet texture coding, mesh coding with texture mapping, media integration of text and graphics, text to speech synthesis (TTS) and structured audio. The SNHC technology standardized in MPEG-4 Version 1 will support applications such as multimedia broadcasting and presentations, virtual talking humans, advanced inter-personal communication systems, games, story telling, language teaching, speech rehabilitation, tele-shopping, tele-learning, etc. based on or including text and 2D/3D graphics capabilities. More ambitious goals will be pursued with MPEG-4 Version 2 6, which will complement MPEG-4 Version 1 by including new tools, providing additional functionalities, such as body animation. Each stage of MPEG-4 Version 2 is foreseen to happen about one year after the corresponding stage for Version 1.
Among the technologies to be standardized in MPEG-4 Version 1 and developed in the SNHC sub-group, 3D facial animation assumes a special relevance since the use of 3D model-based coding applied to human facial models may bring significant advantages, notably for critical bandwidth conditions. The standardization of the parts of this technology, essential to guarantee interoperability, may significantly accelerate the deployment of applications using synthetic human heads, which represent real or fictitious humans. The animation of 3D facial models requires animation data, which may be synthetically generated or extracted by analysis from real faces, depending on the application. Analysis is usually a complex task with precise constraints, which strongly depend on the application conditions. As a consequence, analysis may be real-time or off-line, fully automatic or human guided. While videotelephony-like applications typically require real-time and fully automatic facial analysis, story-telling applications may allow off-line, human guided analysis. Other applications, such as tele-shopping and gaming, may not even need any analysis at all since there may be no intention to reproduce a real face but just to create an entertaining face which may be accomplished by means of a facial animation editor.
Besides giving an overview on the MPEG-4 facial animation technology, this paper describes the implementation of the MPEG-4 facial animation system developed at Instituto Superior Técnico (IST). The system follows the MPEG-4 Visual and Systems FDIS, issued in October 1998 3,4. In section 2, the facial animation technology included in the MPEG-4 Version 1 FDIS, as well as the decisions regarding facial animation profiling will be described. While section 3 will describe the MPEG-4 facial animation system implemented at IST, section 4 will present results, allowing a first evaluation of MPEG-4 facial animation technology.