To Infinity And Beyond

Capturing the world one voxel at a time

[ABOUT] - [OVERVIEW] - [ARCHITECTURE] - [PROGRESS] - [REFERENCES] - [3D Volumetric Video Capture] - [OTHER PROBLEMS]

About

We have all grown up watching our superheroes come to life in the Marvel cinematic universe. Don’t you think it is time to wear a suit and join them? With the rise of the metaverse and the reality of the virtual worlds, we could realize our dreams of being vigilantes of the virtual universe. Our volumetric performance capture pipeline aims at easing the entire capture pipeline enabling everyone to live their dreams. Unlike traditional capture systems which range from high-end light stage-based capture to multi-DSLR camera setup, we target to create our relightable virtual replicas using single iPhone-based captures. We utilize 3D deep learning technology to aid our volumetric performance capture and rendering pipeline. Our virtual counterparts are captured as PBR texture components enabling them to be rendered in different environments from your 3D scanned home to the depths of Mordor in LOTR. Also, with the digital fashion assets your options are boundless. Become the next Batman/Superman to the next diva to walk the ramp in France. Try out all the high-end fashion without shelling the big bucks. Party with your heros in their favorite costumes. They would be compatible with traditional software stacks like Unity3D, Unreal Engine or Blender.

Further, the models would be animated using markerless motion capture (mocap) technology. Our pipeline on a high level could be broken down into mmultiple major blocks (a) Full body capture (b) Animation Performance Capture (c) Relightable Environment Renders (d) Digital Fashion Asset creation. Our solutions unlock the creativity in the hands of everyone to produce Hollywood-style animation using just a mobile phone.

Overview

screen reader text As mentioned in the previous section there are four major blocks. The first block involves full body capture where we scan a stationary person preferably in an ‘A’ pose or a ‘T’ pose. This capture acts as a canonical representation from which we are able to animate all other poses. The captured model is decomposed into their component diffuse, specular albedos and microscopic surface normals. The next block involves learning the weights which are used to deform the canonical pose with respect to the target pose. It involves extracting the performance animation. The poses extracted are used to animate the canonical representation to create our target performance. The next block deals with placing our relightable models in the target environment. The target 3D environments are also generated by static scanning or reusing already created 3D environments shared by users. The final block involves creation of digital fashion assets like virtual costumes, jewels, accessories (bags, belts, hats etc). This enables people to perform virtual photoshoots, VFX shots and ads at a low cost.

Volumetric Performance Capture

Lets dive a little deep into the individual blocks and understand the process in detail.

Full Body Capture

Scanning Process Full body scan

For a full body, a scan involves a person standing in an ‘A’ pose while another person goes around the person capturing the person using LiDAR-based iPhone scans. Using the RGB/D (depth) image we use a hybrid intrinsic-extrinsic representation (namely DMTet) and a differentiable renderer (namely Nvdiffrast) to capture the model geometry and textures (Diffusion and specular albedos and surface normals). Unlike the whole body, the deformations on facial expressions are complex. To enable this we do a high-resolution canonical pose for the face separately.

Performance Animation Capture

Fullbody animation rig Face performace capture

Having captured the canonical mesh representation, the next step is to capture the actual performance. In this stage, we utilize state-of-the-art pose estimation algorithms to capture the 3D pose of the performance (namely OpenPose along with the depth information). With the target performance and the estimated pose, we learn the blending weights for the different poses by minimising the reconstruction error on the rendered images. Here too we run a separate stage to capture the facial performance and learn the blending weights anchored on the facial key points. With the help of the blending weights and the estimated pose, we should be able to build the rig for our replica models enabling them to be animated for any performance.

Religtable Environment Renders

Faces in different environment HDRI Environment map

The final stage involves rendering the performance in the specified environment. For this stage, we create 3D rigid body models of the world. We then extract an HDRI environment map of the scanned 3D environment. Using this environment map, we place our animated 3D objects in the environment and render them to match the encoded 3D world. With this setup in place, we would be able to render our performance in a photorealistic quality under varying environments.

Digital Fashion Asset Creation

Dress Jewel
Dress Jewel Jewel

We can create digital assets either by using available clothing software like clo3d. However, here plan to scan real life fashion assets into digital virtual assets. Like the way we scan full body using A pose, here to we make human models to wear the fashion assets and then scan them and create the virtual assets. Instead of scanning individual assets separately, we plan to capture the human model wearing all the accessories and capturing the person form multi-view RGBD images. We then use human part segmentation to extract individual accessories and generate their 3D virtual assets. This also enables one to generate digital assets from images available on the internet.

Task Breakdown

  • Learning pose parameters and generating the static mesh using point cloud [2 Weeks]
    • Use OpenPose to estimate Facial parameters
    • Capture RGBD images and try to estimate the head rotation
      • Key points based RANSAC based algorithm
    • Preprocessing to identify frames with best match for rotation
      • Using proper depth information for not clear regions and
      • Best rotation estimates pictures
      • Best non blurry images
      • Four minimum pictures
    • Canonical Representation
      • Front facing face
  • Learning environment lighting and material properties (Texture Learning) [2 weeks]
    • PBR Optimize for all the different pictures
    • Learn environment lighting
    • Learn separate normals for each color to render things better
  • Refining initial static mesh by refining pose and DMTet [3 weeks less precise]
    • Refine by assuming smooth surfaces normal smooth
    • Refine environment lightning and material properties using images
  • Learning Riggable model using Facescape [4 weeks least precise]
    • Fit a facescape model to the learnt mesh
    • Map the facescape weights as the weights of learnt mesh - chamfer distance
      • Gives basic riggable model
    • Use displacement maps by using actual motions to learn deformations specific to people
    • Facescape (3DMM/SMPLX) based model fitting (Useful for rigging)
      • 51 expressions controlled by 51 entries. Identify pose using openCV like expressions like happy, sad etc and then transfer that through binary values to rig the facial expression. How to do it seamlessly is something not fully clear though
      • FACS Based system
      • Identify the mesh weights for deformations based on keypoints
      • Mapping new facial expressions to existing facial expressions
      • Mapping kep points from another face to this face

References