tags:
- neural-architecture-search
- evolutionary-computation
- computer-vision
- depth-estimation
- object-detection
- semantic-segmentation
- 3d-gaussian-splatting
- mamba
- vision-transformer
- multi-objective-optimization
datasets:
- imagenet-1k
- detection-datasets/coco
- scene_parse_150
- kitti
- nyu_depth_v2
- RealEstate10K
metrics:
- mAP
- miou
- abs_rel
- psnr
- ssim
pipeline_tag: depth-estimation
library_name: pytorch
EvoNAS: Dual-Domain Representation Alignment for Geometry-Aware Architecture Search
Overview
EvoNAS is a multi-objective evolutionary neural architecture search framework that discovers Pareto-optimal vision backbones bridging 2D dense prediction and 3D rendering. It features:
- Hybrid VSS-ViT Search Space: Combines Vision State Space (Mamba) blocks with Vision Transformers
- CA-DDKD: Cross-Architecture Dual-Domain Knowledge Distillation via DCT constraints
- DMMPE: Hardware-isolated distributed evaluation engine for unbiased latency measurement
- Progressive Supernet Training (PST): Curriculum-based weight-sharing optimization
The discovered EvoNets achieve state-of-the-art accuracy-efficiency trade-offs across object detection, semantic segmentation, monocular depth estimation, and novel view synthesis.
Model Zoo
Searched Architectures (EvoNets)
Object Detection on COCO (Mask R-CNN)
| Model |
Params |
MACs |
AP^b |
Latency |
Throughput |
NID |
Weight |
| EvoNet-C1 |
33M |
190G |
45.4 |
50.2ms |
26 FPS |
1.39 |
Download |
| EvoNet-C2 |
36M |
202G |
47.1 |
55.4ms |
23 FPS |
1.29 |
Download |
| EvoNet-C3 |
42M |
228G |
48.5 |
66.9ms |
18 FPS |
1.15 |
Download |
Semantic Segmentation on ADE20K (UPerNet)
| Model |
Params |
MACs |
mIoU |
Latency |
Throughput |
NID |
Weight |
| EvoNet-A1 |
23M |
711G |
44.1 |
77.3ms |
14 FPS |
1.93 |
Download |
| EvoNet-A2 |
26M |
724G |
47.3 |
81.0ms |
13 FPS |
1.79 |
Download |
| EvoNet-A3 |
32M |
754G |
49.7 |
94.8ms |
12 FPS |
1.57 |
Download |
Monocular Depth Estimation on KITTI
| Model |
Params |
MACs |
Abs Relβ |
Ξ΄ββ |
Latency |
Throughput |
NID |
Weight |
| EvoNet-K1 |
18.0M |
27.3G |
0.060 |
0.960 |
18.6ms |
117 FPS |
5.34 |
Download |
| EvoNet-K2 |
22.6M |
36.2G |
0.056 |
0.966 |
24.6ms |
83 FPS |
4.28 |
Download |
| EvoNet-K3 |
26.3M |
45.0G |
0.054 |
0.969 |
28.0ms |
65 FPS |
3.68 |
Download |
Monocular Depth Estimation on NYU Depth v2
| Model |
Params |
MACs |
Abs Relβ |
Ξ΄ββ |
Latency |
Throughput |
NID |
Weight |
| EvoNet-N1 |
19.1M |
21.7G |
0.095 |
0.912 |
21.8ms |
138 FPS |
4.77 |
Download |
| EvoNet-N2 |
24.1M |
27.1G |
0.089 |
0.926 |
25.9ms |
107 FPS |
3.85 |
Download |
| EvoNet-N3 |
30.3M |
33.9G |
0.085 |
0.932 |
30.8ms |
88 FPS |
3.08 |
Download |
Novel View Synthesis on RealEstate10K (3DGS)
| Model |
Params |
PSNRβ |
SSIMβ |
LPIPSβ |
Latency |
Throughput |
Weight |
| EvoNet-D |
44M |
26.41 |
0.871 |
0.127 |
88ms |
27 FPS |
Download |
Supernet Checkpoints
| Checkpoint |
Description |
Weight |
| supernet_imagenet_1k |
Stage 1: ImageNet-1K pretrained VSS-ViT supernet |
Download |
| supernet_nyu |
Stage 2: Fine-tuned on NYU Depth v2 with CA-DDKD |
Download |
| supernet_kitti |
Stage 2: Fine-tuned on KITTI with CA-DDKD |
Download |
| supernet_ade20k |
Stage 2: Fine-tuned on ADE20K with CA-DDKD |
Download |
| supernet_coco |
Stage 2: Fine-tuned on COCO with CA-DDKD |
Download |
Teacher Models (Depth Anything)
| Checkpoint |
Description |
Weight |
| nyu_depth_anything |
Depth Anything metric indoor teacher |
Download |
| kitti_depth_anything |
Depth Anything metric outdoor teacher |
Download |
| ade20k_vitl |
ViT-L teacher for ADE20K segmentation |
Download |
| coco_dinov2 |
DINOv2 teacher for COCO detection |
Download |
Quick Start
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="YOUR_USERNAME/EvoNAS",
filename="EvoNAS/evonet_n3_best_abs_rel_0.08475",
)
supernet_path = hf_hub_download(
repo_id="YOUR_USERNAME/EvoNAS",
filename="supernet_imagenet_1k.pth",
)
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="YOUR_USERNAME/EvoNAS",
local_dir="./evonas_checkpoints",
)
Usage
Please refer to our GitHub repository for full training, search, and evaluation instructions.
Inference Example (Monocular Depth Estimation)
import torch
from networks.EvoMambaDepthNet import EvoMambaDepthNet
evonet_n3_genotype = {
"d_state": [...],
"ssm_expand": [...],
"mlp_ratio": [...],
"depth": [...],
}
model = EvoMambaDepthNet(genotype=evonet_n3_genotype)
checkpoint = torch.load("evonet_n3_best_abs_rel_0.08475", map_location="cpu")
model.load_state_dict(checkpoint["model"])
model.eval()
with torch.no_grad():
depth = model(image_tensor)
File Structure
.
βββ EvoNAS/ # Searched EvoNet checkpoints
β βββ evonet_c{1,2,3}_* # COCO object detection
β βββ evonet_a{1,2,3}_* # ADE20K semantic segmentation
β βββ evonet_k{1,2,3}_* # KITTI depth estimation
β βββ evonet_n{1,2,3}_* # NYU v2 depth estimation
β βββ logs/ # Training logs
βββ NVS/ # Novel view synthesis checkpoint
β βββ epoch_9-step_150000.ckpt
βββ SuperNet_FT/ # Fine-tuned supernet checkpoints
β βββ supernet_ade20k.pth
β βββ supernet_coco.pth
β βββ supernet_kitti
β βββ supernet_nyu
βββ pre_DA/ # Teacher model checkpoints
β βββ ade20k_vitl_mIoU_59.4.pth
β βββ coco_dinov2_epoch_12.pth
β βββ kitti_depth_anything_metric_depth_outdoor.pt
β βββ nyu_depth_anything_metric_depth_indoor.pt
βββ supernet_imagenet_1k.pth # ImageNet-1K pretrained supernet
Citation
@article{zhang2025evonas,
title={Dual-Domain Representation Alignment: Bridging 2D and 3D Vision via Geometry-Aware Architecture Search},
author={Zhang, Haoyu and Yu, Zhihao and Wang, Rui and Jin, Yaochu and Liu, Qiqi and Cheng, Ran},
journal={arXiv preprint arXiv:2603.19563},
year={2025}
}
Acknowledgements
We thank the open-source community behind PyTorch, Mamba SSM, Spatial-Mamba, MMDetection, MMSegmentation, Depth Anything, pymoo, and timm.