To Infinity And Beyond

Capturing the world one voxel at a time

Last updated on Jan 12, 2023

[ABOUT] - [OVERVIEW] - [ARCHITECTURE] - [PROGRESS] - [REFERENCES] - [3D Volumetric Video Capture] - [OTHER PROBLEMS]

About

We have all grown up watching our superheroes come to life in the Marvel cinematic universe. Don’t you think it is time to wear a suit and join them? With the rise of the metaverse and the reality of the virtual worlds, we could realize our dreams of being vigilantes of the virtual universe. Our volumetric performance capture pipeline aims at easing the entire capture pipeline enabling everyone to live their dreams. Unlike traditional capture systems which range from high-end light stage-based capture to multi-DSLR camera setup, we target to create our relightable virtual replicas using single iPhone-based captures. We utilize 3D deep learning technology to aid our volumetric performance capture and rendering pipeline. Our virtual counterparts are captured as PBR texture components enabling them to be rendered in different environments from your 3D scanned home to the depths of Mordor in LOTR. Also, with the digital fashion assets your options are boundless. Become the next Batman/Superman to the next diva to walk the ramp in France. Try out all the high-end fashion without shelling the big bucks. Party with your heros in their favorite costumes. They would be compatible with traditional software stacks like Unity3D, Unreal Engine or Blender.

Further, the models would be animated using markerless motion capture (mocap) technology. Our pipeline on a high level could be broken down into mmultiple major blocks (a) Full body capture (b) Animation Performance Capture (c) Relightable Environment Renders (d) Digital Fashion Asset creation. Our solutions unlock the creativity in the hands of everyone to produce Hollywood-style animation using just a mobile phone.

Overview

screen reader text As mentioned in the previous section there are four major blocks. The first block involves full body capture where we scan a stationary person preferably in an ‘A’ pose or a ‘T’ pose. This capture acts as a canonical representation from which we are able to animate all other poses. The captured model is decomposed into their component diffuse, specular albedos and microscopic surface normals. The next block involves learning the weights which are used to deform the canonical pose with respect to the target pose. It involves extracting the performance animation. The poses extracted are used to animate the canonical representation to create our target performance. The next block deals with placing our relightable models in the target environment. The target 3D environments are also generated by static scanning or reusing already created 3D environments shared by users. The final block involves creation of digital fashion assets like virtual costumes, jewels, accessories (bags, belts, hats etc). This enables people to perform virtual photoshoots, VFX shots and ads at a low cost.

Volumetric Performance Capture

Lets dive a little deep into the individual blocks and understand the process in detail.

Full Body Capture

Scanning Process	Full body scan

For a full body, a scan involves a person standing in an ‘A’ pose while another person goes around the person capturing the person using LiDAR-based iPhone scans. Using the RGB/D (depth) image we use a hybrid intrinsic-extrinsic representation (namely DMTet) and a differentiable renderer (namely Nvdiffrast) to capture the model geometry and textures (Diffusion and specular albedos and surface normals). Unlike the whole body, the deformations on facial expressions are complex. To enable this we do a high-resolution canonical pose for the face separately.

Performance Animation Capture

Fullbody animation rig	Face performace capture

Having captured the canonical mesh representation, the next step is to capture the actual performance. In this stage, we utilize state-of-the-art pose estimation algorithms to capture the 3D pose of the performance (namely OpenPose along with the depth information). With the target performance and the estimated pose, we learn the blending weights for the different poses by minimising the reconstruction error on the rendered images. Here too we run a separate stage to capture the facial performance and learn the blending weights anchored on the facial key points. With the help of the blending weights and the estimated pose, we should be able to build the rig for our replica models enabling them to be animated for any performance.

Religtable Environment Renders

Faces in different environment	HDRI Environment map

The final stage involves rendering the performance in the specified environment. For this stage, we create 3D rigid body models of the world. We then extract an HDRI environment map of the scanned 3D environment. Using this environment map, we place our animated 3D objects in the environment and render them to match the encoded 3D world. With this setup in place, we would be able to render our performance in a photorealistic quality under varying environments.

Digital Fashion Asset Creation

Dress	Jewel

We can create digital assets either by using available clothing software like clo3d. However, here plan to scan real life fashion assets into digital virtual assets. Like the way we scan full body using A pose, here to we make human models to wear the fashion assets and then scan them and create the virtual assets. Instead of scanning individual assets separately, we plan to capture the human model wearing all the accessories and capturing the person form multi-view RGBD images. We then use human part segmentation to extract individual accessories and generate their 3D virtual assets. This also enables one to generate digital assets from images available on the internet.

Task Breakdown

Learning pose parameters and generating the static mesh using point cloud [2 Weeks]
- Use OpenPose to estimate Facial parameters
- Capture RGBD images and try to estimate the head rotation
  - Key points based RANSAC based algorithm
- Preprocessing to identify frames with best match for rotation
  - Using proper depth information for not clear regions and
  - Best rotation estimates pictures
  - Best non blurry images
  - Four minimum pictures
- Canonical Representation
  - Front facing face
Learning environment lighting and material properties (Texture Learning) [2 weeks]
- PBR Optimize for all the different pictures
- Learn environment lighting
- Learn separate normals for each color to render things better
Refining initial static mesh by refining pose and DMTet [3 weeks less precise]
- Refine by assuming smooth surfaces normal smooth
- Refine environment lightning and material properties using images
Learning Riggable model using Facescape [4 weeks least precise]
- Fit a facescape model to the learnt mesh
- Map the facescape weights as the weights of learnt mesh - chamfer distance
  - Gives basic riggable model
- Use displacement maps by using actual motions to learn deformations specific to people
- Facescape (3DMM/SMPLX) based model fitting (Useful for rigging)
  - 51 expressions controlled by 51 entries. Identify pose using openCV like expressions like happy, sad etc and then transfer that through binary values to rig the facial expression. How to do it seamlessly is something not fully clear though
  - FACS Based system
  - Identify the mesh weights for deformations based on keypoints
  - Mapping new facial expressions to existing facial expressions
  - Mapping kep points from another face to this face

References

Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis
Extracting Triangular 3D Models, Materials, and Lighting From Images
Nvdiffrast
Environment Mapping
Marching Tetrahedra
DefTet: Learning Deformable Tetrahedral Meshes for 3D Reconstruction
Face model and estimation:
- https://github.com/tencent-ailab/hifi3dface , https://tencent-ailab.github.io/hifi3dface_projpage/ , https://arxiv.org/pdf/2010.05562.pdf
Face Model:
- https://github.com/zhuhao-nju/facescape
- https://flame.is.tue.mpg.de/
- SMPLX face model
Learning from pointcloud DMTet:
- https://github.com/NVIDIAGameWorks/kaolin/blob/master/examples/tutorial/dmtet_tutorial.ipynb
Hair mesh:
- http://www.hao-li.com/publications/papers/siggraphAsia2017ADFSIFRTR.pdf
- Supplementary material talks about hair model? https://tencent-ailab.github.io/hifi3dface_projpage/files/tog2021_hifi3dface_supp.pdf
Teeth Model:
- Accurate markerless jaw tracking for facial performance capture., Model-based teeth reconstruction. An empirical rig for jaw animation. ACM Trans. Graph. (P
- https://la.disneyresearch.com/wp-content/uploads/Appearance-Capture-and-Modeling-of-Human-Teeth-Thumbnail.pdf
Eye Model:
- https://studios.disneyresearch.com/wp-content/uploads/2019/03/Lightweight-Eye-Capture-Using-a-Parametric-Model.pdf
- Practical PersonSpecific Eye Rigging
- https://cgl.ethz.ch/Downloads/Publications/Papers/2019/Pas19a/Pas19a.pdf
- eye model 3d face model
Method for optimizing parameters: RANSAC
- http://www.cse.yorku.ca/~kosta/CompVis_Notes/ransac.pdf
Face Rigging: [face mesh from rgbd map]