Depth Pro Estimator Block
A custom Modular Diffusers block for monocular depth estimation using Apple's Depth Pro model. Supports both images and videos.
Features
- Metric depth estimation in real-world meters using Depth Pro
- Image and video input support
- Grayscale or turbo colormap visualization
- Inverse depth normalization (following Apple's reference implementation) for robust handling of outdoor/sky scenes
Installation
# Using uv
uv sync
# Using pip
pip install -r requirements.txt
Quick Start
Load the block
from diffusers import ModularPipelineBlocks
import torch
blocks = ModularPipelineBlocks.from_pretrained(
"your-username/depth-pro-estimator", # or local path "."
trust_remote_code=True,
)
pipeline = blocks.init_pipeline()
pipeline.load_components(torch_dtype=torch.float16)
pipeline.to("cuda")
Single image - grayscale depth
from PIL import Image
image = Image.open("photo.jpg")
output = pipeline(image=image)
# Save depth map
output.depth_image.save("photo_depth.png")
# Access raw metric depth tensor (in meters)
print(output.predicted_depth.shape) # (H, W)
print(output.field_of_view) # estimated FOV
print(output.focal_length) # estimated focal length
Single image - turbo colormap
output = pipeline(image=image, colormap="turbo")
output.depth_image.save("photo_depth_turbo.png")
Video - grayscale depth
from block import save_video
output = pipeline(video_path="input.mp4", colormap="grayscale")
save_video(output.depth_frames, output.fps, "output_depth.mp4")
Video - turbo colormap
output = pipeline(video_path="input.mp4", colormap="turbo")
save_video(output.depth_frames, output.fps, "output_depth_turbo.mp4")
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
image |
PIL.Image |
- | Image to estimate depth for |
video_path |
str |
- | Path to input video. When provided, image is ignored |
colormap |
str |
"grayscale" |
"grayscale" or "turbo" (colormapped) |
Outputs
Image mode
| Output | Type | Description |
|---|---|---|
depth_image |
PIL.Image |
Normalized depth visualization |
predicted_depth |
torch.Tensor |
Raw metric depth in meters (H x W) |
field_of_view |
float |
Estimated horizontal FOV |
focal_length |
float |
Estimated focal length |
Video mode
| Output | Type | Description |
|---|---|---|
depth_frames |
List[PIL.Image] |
Per-frame depth visualizations |
fps |
float |
Source video frame rate |
Depth Normalization
Depth visualization uses inverse depth clipped to [0.1m, 250m], following Apple's reference implementation. This prevents sky/infinity values (clamped at 10,000m by the model) from crushing near-field detail into a binary mask.
- Bright = close, dark = far (grayscale)
- Warm (red/yellow) = close, cool (blue) = far (turbo)
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support