Depth Pro Estimator Block

A custom Modular Diffusers block for monocular depth estimation using Apple's Depth Pro model. Supports both images and videos.

Features

  • Metric depth estimation in real-world meters using Depth Pro
  • Image and video input support
  • Grayscale or turbo colormap visualization
  • Inverse depth normalization (following Apple's reference implementation) for robust handling of outdoor/sky scenes

Installation

# Using uv
uv sync

# Using pip
pip install -r requirements.txt

Quick Start

Load the block

from diffusers import ModularPipelineBlocks
import torch

blocks = ModularPipelineBlocks.from_pretrained(
    "your-username/depth-pro-estimator",  # or local path "."
    trust_remote_code=True,
)
pipeline = blocks.init_pipeline()
pipeline.load_components(torch_dtype=torch.float16)
pipeline.to("cuda")

Single image - grayscale depth

from PIL import Image

image = Image.open("photo.jpg")
output = pipeline(image=image)

# Save depth map
output.depth_image.save("photo_depth.png")

# Access raw metric depth tensor (in meters)
print(output.predicted_depth.shape)  # (H, W)
print(output.field_of_view)          # estimated FOV
print(output.focal_length)           # estimated focal length

Single image - turbo colormap

output = pipeline(image=image, colormap="turbo")
output.depth_image.save("photo_depth_turbo.png")

Video - grayscale depth

from block import save_video

output = pipeline(video_path="input.mp4", colormap="grayscale")
save_video(output.depth_frames, output.fps, "output_depth.mp4")

Video - turbo colormap

output = pipeline(video_path="input.mp4", colormap="turbo")
save_video(output.depth_frames, output.fps, "output_depth_turbo.mp4")

Inputs

Parameter Type Default Description
image PIL.Image - Image to estimate depth for
video_path str - Path to input video. When provided, image is ignored
colormap str "grayscale" "grayscale" or "turbo" (colormapped)

Outputs

Image mode

Output Type Description
depth_image PIL.Image Normalized depth visualization
predicted_depth torch.Tensor Raw metric depth in meters (H x W)
field_of_view float Estimated horizontal FOV
focal_length float Estimated focal length

Video mode

Output Type Description
depth_frames List[PIL.Image] Per-frame depth visualizations
fps float Source video frame rate

Depth Normalization

Depth visualization uses inverse depth clipped to [0.1m, 250m], following Apple's reference implementation. This prevents sky/infinity values (clamped at 10,000m by the model) from crushing near-field detail into a binary mask.

  • Bright = close, dark = far (grayscale)
  • Warm (red/yellow) = close, cool (blue) = far (turbo)
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support