Depth Pro Estimator Block

A custom Modular Diffusers block for monocular depth estimation using Apple's Depth Pro model. Supports both images and videos.

Features

Metric depth estimation in real-world meters using Depth Pro
Image and video input support
Grayscale or turbo colormap visualization
Inverse depth normalization (following Apple's reference implementation) for robust handling of outdoor/sky scenes

Installation

# Using uv
uv sync

# Using pip
pip install -r requirements.txt

Quick Start

Load the block

from diffusers import ModularPipelineBlocks
import torch

blocks = ModularPipelineBlocks.from_pretrained(
    "your-username/depth-pro-estimator",  # or local path "."
    trust_remote_code=True,
)
pipeline = blocks.init_pipeline()
pipeline.load_components(torch_dtype=torch.float16)
pipeline.to("cuda")

Single image - grayscale depth

from PIL import Image

image = Image.open("photo.jpg")
output = pipeline(image=image)

# Save depth map
output.depth_image.save("photo_depth.png")

# Access raw metric depth tensor (in meters)
print(output.predicted_depth.shape)  # (H, W)
print(output.field_of_view)          # estimated FOV
print(output.focal_length)           # estimated focal length

Single image - turbo colormap

output = pipeline(image=image, colormap="turbo")
output.depth_image.save("photo_depth_turbo.png")

Video - grayscale depth

from block import save_video

output = pipeline(video_path="input.mp4", colormap="grayscale")
save_video(output.depth_frames, output.fps, "output_depth.mp4")

Video - turbo colormap

output = pipeline(video_path="input.mp4", colormap="turbo")
save_video(output.depth_frames, output.fps, "output_depth_turbo.mp4")

Inputs

Parameter	Type	Default	Description
`image`	`PIL.Image`	-	Image to estimate depth for
`video_path`	`str`	-	Path to input video. When provided, `image` is ignored
`colormap`	`str`	`"grayscale"`	`"grayscale"` or `"turbo"` (colormapped)

Outputs

Image mode

Output	Type	Description
`depth_image`	`PIL.Image`	Normalized depth visualization
`predicted_depth`	`torch.Tensor`	Raw metric depth in meters (H x W)
`field_of_view`	`float`	Estimated horizontal FOV
`focal_length`	`float`	Estimated focal length

Video mode

Output	Type	Description
`depth_frames`	`List[PIL.Image]`	Per-frame depth visualizations
`fps`	`float`	Source video frame rate

Depth Normalization

Depth visualization uses inverse depth clipped to [0.1m, 250m], following Apple's reference implementation. This prevents sky/infinity values (clamped at 10,000m by the model) from crushing near-field detail into a binary mask.

Bright = close, dark = far (grayscale)
Warm (red/yellow) = close, cool (blue) = far (turbo)

Downloads last month: -

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support