TAPVid-3D: A Benchmark for Tracking Any Point in 3D

1Google DeepMind, 2University College London, 3University of Oxford
(*: equal contribution)

The TAPVid-3D dataset: a dataset of real-world videos and metric 3D point trajectories.

What is the dataset?

TAPVid-3D is a dataset and benchmark for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D). The dataset consists of 4,000+ real-world videos and 2.1 million metric 3D point trajectories, spanning a variety of object types, motion patterns, and indoor and outdoor environments.

While point tracking in two dimensions (TAP-2D) has many benchmarks measuring performance on real-world videos, such as TAPVid-DAVIS [2], benchmarks for three-dimensional point tracking on real-world videos were lacking. To fill this gap, we built a new benchmark for 3D point tracking leveraging existing footage.

To measure performance on the TAP-3D task, we formulated a Jaccard-based metric to handle the complexities of ambiguous depth scales across models, occlusions, and multi-track spatio-temporal smoothness.

In the paper, we assess the current state of the TAP-3D task by constructing competitive baselines using existing tracking models, such as SpatialTracker. You can read more and find out how to download and generate the data using the GitHub link above. We hope you'll find the benchmark useful!

Video Summary

Dataset Samples

Statistics Overview

#clips

#trajs per clip

#frames per clip

#videos

#scenes

resolution

fps

4569

50 - 1024

25 − 300

2828

255

Multiple

10 / 30

Licensing

The annotations and code to generate TAPVid-3D are released under a slightly modified Apache 2.0 license, as described in the LICENSE file in GitHub. In particular, to use the code and annotations for a particular data subset (Waymo Open, Aria Digital Twin, and Panoptic Studio), you must agree and adhere to license and terms of use for usage of the corresponding data and annotations.

Related Links

1. Kubric: A scalable dataset generator is a data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.

2. TAP-Vid: A Benchmark for Tracking Any Point in a Video builds an evaluation dataset with 2D points tracked across real videos.

3. Tracking Everything Everywhere All at Once presents a test-time optimization method for estimating dense and long-range motion from a video sequence.

4. Pointodyssey: A large-scale synthetic dataset for long-term point tracking propose a large-scale synthetic dataset and data generation framework for 3D dynamic scenes.

5. SpatialTracker: Tracking Any 2D Pixels in 3D Space estimates point trajectories in 3D space and runs 2D tracking evaluation and 3D tracking evaluation on synthetic videos.