3D Volumetric Video Capture

Capturing the world one voxel at a time

[ABOUT] - [OVERVIEW] - [ARCHITECTURE] - [NEURAL RENDERING PIPELINE]- [PROGRESS] - [REFERENCES] - [CODE (under dev)] - [OTHER PROBLEMS]

About

Videos and photos are used to capture our happy and memorable moments we shared with our friends and family. It takes us back to the place and time when the event happened, helping us reminiscence the moment. Instead, how would it be if we could capture these fleeting moments in 3D? Store and share them with your friends and family like you share photos and videos? Invite them to that moment so that we all could relive it? We are building a system to recreate an immersive experience to bring your memories to life in VR. Wanna ride along mail me @ nitthilan@gmail.com

Overview

To give a overview of our solution, a user scans the 3D environment he wants to store, as a video capturing it from different viewing angles. He can use multiple cameras to capture more information about the environment like actions. We use artificial intelligence to reconstruct and encode the 3D environment using the different videos captured by the user. Unlike other methods which use laser or depth sensors to estimate point clouds, we just use the scanned RGB monocular videos to reconstruct the environment. Finally, this captured 3D environment is viewed using a VR headset in a 360-degree stereoscopic panorama. A stereoscopic photo is a pair of images, taken simultaneously with two lenses placed like our eyes, about 65 mm apart and looking in the same direction. When presented to the two eyes by a stereoscope, these images give most people the impression of seeing a 3D space. A stereoscopic panorama is a pair of 360-degree images, which when viewed with synchronized pano viewers (like a VR headset) presents a stereo pair. Further, as the user moves around in the environment, we generate the corresponding stereo pair based on the user position to give the user an immersive 3D experience.

screen reader text

Deep Neural Radiance Field-based 3D Volumetric Video Capture, Rendering, and Streaming

Let’s deep dive into the details of our solution. To provide a user with an immersive experience, the whole media pipeline from capture, storage, rendering, and streaming should be capable of handling 3D volumetric video data. Volumetric video capture technology is a technique that digitizes a three-dimensional space (i.e., the volume of space), object, or environment. The captured object can be digitized and transferred to the web, mobile, or virtual worlds and viewed in 3D. It does not have a set viewpoint, so the end-user can watch and interact with it from all angles, enhancing their experience and heightening their sense of immersion and engagement. The difference between 360-degree video and volumetric video is the depth provided with volume. In a 360-degree video, users can only view the action from a single, constant depth. With volumetric video, the end-user can play director and control how far in or out they want to explore the scene.

screen reader text

First, to capture 3D volumetric data, we would require a multi-camera system, which captures the action from different views. After capturing the data as a set of videos we process the videos such that we break down the scene into a less important background region, a center stage foreground area where the actual action happens. This helps in encoding the background with less detail and the foreground with higher precision. Further, to enhance the quality, we separate the objects in the foreground region into rigid and non-rigid human bodies. In effect, we split the scene into three prominent regions (a) Foreground Rigid Bodies (b) Foreground Non-Rigid (Human) Bodies (c) Background Region. We process them separately as three different pipelines. Unlike, other solutions which encode objects as meshes and textures, we encode each region using a deep neural radiance field-based solution. Next, to provide an immersive visualization, the three regions are rendered separately into a stereo 360-degree panoramic image for each time instant and stream it to a Unity VR app running on a VR headset. Thus we breakdown our system into the following components:

(a) Multi-view capture system

We design a capture system that ranges from 1 to N monocular RGB cameras. This could be as simple as one or more mobile cameras capturing a performance or scene from different angles. Since our setup is simple it enables anyone to create content that could be immersively consumed. The capture is timestamped so that multiple captures can be correlated for a particular time instance across different cameras. These different video streams are then uploaded to a cloud server.

(b) Split neural encoder, renderer and compositor

The next stage involves splitting the data into foreground and background, rigid and non-rigid (human) bodies. Each of these components is then encoded separately into a deep neural network. Background region is encoded using Nerf++ based network. The foreground requires optimized rendering (or inference) and many solution options are to be investigated. Some of the options are KiloNerf, PlenOctrees. Finally, the non-rigid body is encoded using Neural body-based network solutions. The scene is composited by rendering the individual neural networks encoded for the three separate regions based on the position and viewing angle feedback from the user. This composited stereo 360-degree panorama image is then passed on to the next stage to be compressed and streamed

(c) 360 Stereo video encoder

The final stage, encodes the composited stereo image and compresses it into a video stream (H264, MPEG, MJPEG) and streamed to the VR headset. A Unity-based VR app displays the compressed stream on the skybox shader to provide an immersive visualization of the captured sequence to the end-user

Progress

Stage 2

  • Check the detailed architecture for the current implementation

  • Dataloader and feature extractor module based on depth - 1 day - Done

  • Data generation using iPhone/iPad Pro - 0.5 day - Done

  • Inverse depth space nerf++ renderer - 1 day

  • Training loop for non-rigid neural renderer - 2 days - Done

    • 3D sparse convolution
    • raycasting based rendering using volumetric prediction head
    • Dataloader and cost function implementation
  • Testing and Debugging (Completion of basic pipeline) - 2 days

  • Experiment with surface prediction head - 2 day

  • Non-rigid volume renderer - 3 days

    • Pose estimation using LiDAR
    • Volume feature generator using ray triangle intersection
    • Testing and Debugging
  • Unity VR application using hand gestures and locomotion - 3 days

Stage 1

  • Individual modules are available for each module
  • Implemented a end to end pipeline from capturing a video to rendering it and visualizing the output using a VR headset
  • Baseline modules used:
    • Colmap: To learn camera parameters
    • Nerf++: Encoding unbound scenes using an inverted sphere parameterization
    • Unity Oculus VR App: Rendering Stereo 360 degree panoroma images at different camera positions
  • Learnings:
    • Colmap takes exponentially more time to identify camera parameters when the number of images to be registered are large. The mapper module spends 17 minutes for 200 odd images and this increases to 1-2 hours with 500 images and larger dimension
    • Training time for Nerf++ - Takes 18-24 hours on a 4 GPU system
    • Slow inference times - 14 seconds per 4000x4000 dimension image (top bottom stereo)
  • Using 360 degree panoroma stereo may not work. So initially try just planar stereo like the demo videos
  • End to end pipeline with best qualities picked from different NeRF/MVS implementation
  • Reduce the time across the whole pipeline. Major bottlenecks:
    • Colmap based structure from motion (SfM) estimation
    • Overfitting the NeRF module with excessive input views
    • Depth estimation for feature mapping and dense voxel processing over the whole volume
  • The whole pipeline is implemented in Neural radiance field based solution

Stage 0

  • The encoding (or storage) is time-consuming since it depends on the training time which is at present in the order of hours
  • The streaming delay would still be significant since it is done through the cloud

Neural Encoder [Architecture and Pipeline]

In this section we explain the proposed neural encoder architecture where we try to take the best aspects of the various proposed solutions in literature and combine them together into a pipeline which address the pitfalls in their individual solutions. The architecture is generic, addressing both rigid and non-rigid body captures in a unbounded environment. Further, it tries to simplify the capture mechanism by a single iPhone video with Lidar depth information. The major pitfalls with the earlier proposed solutions used in Stage 1 was the long training time it takes for representing the 3D volumetric data using neural networks. This long training time was due to (a) estimation of camera intrinsic and extrinsic camera parameters using Structure from Motion approach used in Colmap. We utilize the IMU data availble on iPhones to estimate this (b) estimating the depth information of surfaces for mapping voxel features. We utilize the Lidar depth information to generate a feature point cloud which we can map to a particular voxel (c) overfitting of NeRF based variants using excessing input images to approximate the 3D volume. We utilize the ResNet convolution features extracted from the image to bootstrap the neural rendering process, thus inturn reducing the training time. Further, we use baysian optimization techniques to choose appropriate subset input images to reduce the training time.

screen reader text

Overview

The neural encoder takes as input a series of images captured from different angles, their corresponding camera parameters (intrinsic and extrinsic), the lidar depth information. The pipleine splits the world 3D region into two different spaces (a) Unit depth volume space (b) Inverse depth volume space. The region of interest defined by where the action happens is normalized to lie within a unit cube. This region is called the Unit volume space. Using a unit cube instead of a sphere helps in spliting the region into voxels thereby helping in speeding up inference times of the neural rendering pipeline. We use a voxel based neural encoding approach to learn the 3D volume/surface. The region outside the normalized unit cube is mapped to a inverse depth volume, and we use a single NeRF MLP to encode this region. Finally, when a user request a output image from a new camera position, we render the values from both the regions and combine them together to display it to the user.

Pipeline

Inverse Depth Volume Space

screen reader text In the inverse depth space, as proposed by NeRF++, we convert every 3D point to their inverses and parameterize the NeRF MLP with (x', y', z' and r') as shown in the figure. This maps all the values from 1 to infinity to values within 1 to 0.

Depth Volume Space

screen reader text Depth volume space is more complex, since we capture both rigid and non-rigid (human actions) sequences using this pipeline. We split it into two stages (a) Volume Feature Generator (b) Neural Voxel Renderer.

Volume Feature Generator

The volume feature generator module takes images captured in multiple camera directions, passes it through a ResNet based convolution feature generator and then using the Lidar depth information maps it to a voxel inside the unit cube. The features mapped from different images to a same voxel are then averaged to estimate the voxel feature.

Rigid body volume feature [Depth Voxel Feature Mapper]

For rigid bodies we do not do anything special since it does not have any deformations.

Non-Rigid body volume feature [SmplX Feature Mapper]

However, in case of non-rigid bodies, based on the motion, the 3D geometry is defomrmed and also gets mapped to different regions of the 3D voxel due to the changes in pose. We adapt the ideas from the Neural body paper, where we first estimate the 3D pose from images for the SMPL mesh. Then map the convolution features to the SMPL vertices which is then mapped to the corresponding voxel.

Neural Voxel Renderer

The feature volume generated from the previous stage is taken as input for the Neural Voxel Renderer. The voxel features is passed through a sparse 3D UNet architecture to spread the features along the surface of the 3D objects in scene. These features are used as input for the ray casting module which uses a MLP as its head. Based on what the MLP predicts we can differentiate the kind of renderer we plan to learn.

Volumetric Prediction Head

If the MLP head just predicts the transmittance and RGB values alone then it acts as a volumetric prediction head.

Surface Prediction Head

However if it predicts surface normals along with the transmittance and RGB, then it tries to approximate the smooth surface prior there by acting as a Surface prediction head.

Future Problems

This section talks about possible future problems which need to be addressed if we do not have that data available. If camera paramters cannot be estimated using the phone logs then how do we go about estimating them and the issue associated with them. Simillarly, if the depth information available from Lidar is not available how would the complexity of the pipeline change?

Non-rigid body neural rendering:

Extract 2D features and map it to SMPLX. We can adapt this to other non-rigid bodies like animals since they too have a mesh prior - like SMPL there is a mesh for babies and animals. Facebook has such a representation. We can add clothing related mesh prior assuming cloths of a particular form like free flowing etc and then apply this idea [There is a clothing prior https://qianlim.github.io/SCALE, https://cape.is.tue.mpg.de/]. Also, for using a single video - first register the person in their A pose or TPose and then use this info to map the person when he moves

Camera Paramter Estimation:

Current estimation methods using Colmap is time consuming. It exponentially increases with the number of input images for estimation. In particular the

Splitting Voxel Grid:

What are the different ways a voxel grid can be created. Can r, theta, phi representation instead of voxels? Like Nerf++, for nearby regions split it into voxels in the range 0-1 and for further parts make it from 1-inf . The splitting mechanism can be num parts 10 => 1000 voxels, [1/r, 1/r^2, 1/r^3], [10, 100, 1000]/r. Also can Octree be tried in the 1/r region?

Bayesian Optimization:

Calculate inference from different camera views and choose the right set of images to learn [Bayesian optimization for choosing the right set of subset images]. We use bayesian optimization to find areas which need more learning. The camera position (x,y,z, axis angle representation) is the input (at a higher levels regions withing the camera could also be used). The error values between predicted and the actual obtained during training could be used?. Sampling random areas within a image or using a downsampled image size to estimate the over all error based on the inference budget? Inference is costly and so you cannot evaluate all the positions. As the training progresses the old values become better and so has to be reevaluated? Can this be demonstrated with just nerf and assuming a infinite budget and training using images based on inference feedback evaluated at all positions ? Then if we make inference faster this can improve?

Nerf Heads:

PlenOctree

512x512 - NerfSH(PlenOctree) - Octree representation. Finally a single value for every tree node. Since we learn Spherical harmoics at each voxel, we just need to use triliniear interpolation to render the value in that position based on ray direction?

Baking NeRF[Sparse Neural Radiance Grid (SNrG)]

Grid of N^3 and data stored as MB of size B^3 for efficient access. Each voxel stores opacity, diffuse color (Cd), specular features (Vs) The evaluation for direction based effects is done by a single neural network evaluation which sums the features along the ray and produces a color Understand Specualr Features in SNrG, Spherical harmonics and Spherical gaussians in PlenOctree??

NexMex

Uses a set of planes - plane sweep which tries to model the transparency and the Shperical harmonics values k0, k1, kN. Mixed Implicit and explicit representation i.e. though they use a neural network to predict the k0, k1 etc and the alpha values they calculate it at fixed depths and store them in a array instead of directly using the neural network values. Further, they use the neural network for regularizing the k0, k1 and alpha values so that they do not overfit the data.

The good thing from here is they try to learn their own radial basis funtion instead of standard basis like fourier or SH or Spherical gaussian.

MVSNerf:

Uses neighboring images to extract convolutional features i.e. if we have to generate view for a particular camera position, it identifies three (M) nearby images homomorphically projects them all to the reference camera position and creates a Cost Volume. All the processing is from the reference point of view (No global view of the object). Then it uses the color information appended from the neighboring views to regress

Mechanisms for faster inference:

  • FastNerf: Caching techniques i.e. precalculating partial outputs for required directions and using that to optimze the rendering
  • Decoupling view dependent and view-independent parts of the network and precomputing the view dependent parts earlier and doing only the view dependent part
  • Use SiREN for faster convergence

Non-visible voxels identification:

We render alpha maps for all the training views using this voxel grid, keeping track of the maximum ray weight 1 − exp(−σiδi) at each voxel. Compared to naively thresholding by σ at each point, this method eliminates non-visible voxels.

Module which evaluates the whole scene voxel by voxel and identifies voxels which have not been scene by the camera captured images this gives feedback on angles to take camera captures from can GANs or interpolation try filling these areas [Gan or Superresolution approaches for predicting intermediate unknown regions]

Reference