thank you !
Ideally yes but unsure about when i ll be able to get to it.
iliass ayaou
datalyes
AI & ML interests
information retrieval, patent retrieval, knowledge management, data engineering and architecture, NLP
Recent Activity
new activity about 7 hours ago
datalyes/DAPFAM_patent:How are exactly calculated the IN and OUT values?Organizations
repliedto their post about 7 hours ago
posted an update 2 months ago
Post
265
I am happy to share that three models from PATENTEB are now publicly available: patembed-large, patembed-base, and patembed-base_long_4096.
The models can be found here:
https://huggingface.co/collections/datalyes/patembed-models-collection
Feedback, issues, and use cases are very welcome.
The models can be found here:
https://huggingface.co/collections/datalyes/patembed-models-collection
Feedback, issues, and use cases are very welcome.
reactedto nouamanetazi's post with ๐ 5 months ago
Post
4683
After training ๐๐ฆ๐จ๐ฅ๐๐๐ on ๐๐๐ ๐๐๐๐๐ฌ for nearly a month, I've come to realize something most people overlook: ๐ข๐ง๐๐ซ๐๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐ ๐ข๐ฌ ๐ญ๐ก๐ ๐ฆ๐๐ค๐-๐จ๐ซ-๐๐ซ๐๐๐ค ๐๐๐๐ญ๐จ๐ซ ๐ข๐ง ๐๐๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ . ๐ฅ
Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐๐๐ ๐๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐๐% ๐๐๐๐ข๐๐ข๐๐ง๐๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ ๐จ๐ ๐ญ๐ก๐ ๐ก๐๐ซ๐๐ฐ๐๐ซ๐. ๐ ๏ธ
Questions that seemed simple but had no clear answers: Why is ๐๐จ๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ ๐ฌ๐ฅ๐จ๐ฐ๐๐ซ ๐ญ๐ก๐๐ง ๐๐๐ง๐ฌ๐ ๐ฆ๐จ๐๐๐ฅ๐ฌ? Which ๐๐๐๐ ๐๐ฅ๐๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?
That's why we built ๐๐ก๐ ๐๐ฆ๐จ๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ฅ๐๐ฒ๐๐จ๐จ๐ค ๐: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐๐ซ๐๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐ ๐ฅ๐๐ฒ๐๐ซ that most teams get wrong.
We validated real vs theoretical bandwidth across the entire stack: ๐๐๐๐ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐ ๐ ๐๐/๐ฌ, ๐๐๐๐ข๐ง๐ค ๐.๐ ๐ซ๐๐๐๐ก๐ข๐ง๐ ๐๐๐ ๐๐/๐ฌ, ๐๐๐๐ ๐๐๐ง๐ ๐๐ญ ๐๐.๐ ๐๐/๐ฌ. Then we ran collective operations across ๐๐๐ ๐๐๐๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐๐๐ ๐๐/๐ฌ on a single node to ๐๐๐-๐๐๐ ๐๐/๐ฌ across 16 nodes.
If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.
๐๐ก๐ ๐๐ฆ๐จ๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ฅ๐๐ฒ๐๐จ๐จ๐ค: https://lnkd.in/e5MKXUHS
Shared with โค๏ธ by the HuggingFace team
Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐๐๐ ๐๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐๐% ๐๐๐๐ข๐๐ข๐๐ง๐๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ ๐จ๐ ๐ญ๐ก๐ ๐ก๐๐ซ๐๐ฐ๐๐ซ๐. ๐ ๏ธ
Questions that seemed simple but had no clear answers: Why is ๐๐จ๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ ๐ฌ๐ฅ๐จ๐ฐ๐๐ซ ๐ญ๐ก๐๐ง ๐๐๐ง๐ฌ๐ ๐ฆ๐จ๐๐๐ฅ๐ฌ? Which ๐๐๐๐ ๐๐ฅ๐๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?
That's why we built ๐๐ก๐ ๐๐ฆ๐จ๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ฅ๐๐ฒ๐๐จ๐จ๐ค ๐: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐๐ซ๐๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐ ๐ฅ๐๐ฒ๐๐ซ that most teams get wrong.
We validated real vs theoretical bandwidth across the entire stack: ๐๐๐๐ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐ ๐ ๐๐/๐ฌ, ๐๐๐๐ข๐ง๐ค ๐.๐ ๐ซ๐๐๐๐ก๐ข๐ง๐ ๐๐๐ ๐๐/๐ฌ, ๐๐๐๐ ๐๐๐ง๐ ๐๐ญ ๐๐.๐ ๐๐/๐ฌ. Then we ran collective operations across ๐๐๐ ๐๐๐๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐๐๐ ๐๐/๐ฌ on a single node to ๐๐๐-๐๐๐ ๐๐/๐ฌ across 16 nodes.
If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.
๐๐ก๐ ๐๐ฆ๐จ๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ฅ๐๐ฒ๐๐จ๐จ๐ค: https://lnkd.in/e5MKXUHS
Shared with โค๏ธ by the HuggingFace team
reactedto piercus's post with ๐ 5 months ago
Post
4000
Starts erasing! ๐ ๐ ๐
This is made with a one-step SD1.5 LBM [1]ย eraser !
Data is open. Data pipeline is open. Training code is open.
On our LBM fork : https://github.com/finegrain-ai/LBM
[1] LBM: Latent Bridge Matching for Fast Image-to-Image Translation (2503.07535)
This is made with a one-step SD1.5 LBM [1]ย eraser !
Data is open. Data pipeline is open. Training code is open.
On our LBM fork : https://github.com/finegrain-ai/LBM
[1] LBM: Latent Bridge Matching for Fast Image-to-Image Translation (2503.07535)
reactedto sourceoftruthdata's post with โค๏ธ 5 months ago
posted an update 5 months ago
Post
381
# PatenTEB: A Comprehensive Benchmark for Patent Text Embeddings ๐ฏ
Very excited to finally be able to announce the (partial) release of **PatenTEB**, the first comprehensive benchmark specifically designed for evaluating text embedding models on patent-specific tasks!
## ๐ What's Released
### ๐ฆ 15 Benchmark Datasets (NEW to MTEB)
All tasks are **completely new** and not previously available in MTEB or other benchmarks:
- **3 Classification tasks**: Patent citation timing, NLI directionality, IPC3 technology classification
- **2 Clustering tasks**: IPC-based and inventor-based patent grouping
- **8 Retrieval tasks**: 3 symmetric (IN/MIXED/OUT domain) + 5 asymmetric (fragment-to-document matching)
- **2 Paraphrase tasks**: Problem and solution semantic similarity detection
๐ **All datasets**: [huggingface.co/datalyes](@datalyes )
### ๐ค 12 Trained Models
The **patembed model family** (67M-344M parameters):
- 6 core models (large, base, base_small, small, mini, nano)
- 3 long-context variants (1024, 2048, 4096 tokens)
- 3 ablation models (no prompts, retrieval-only, no classification)
๐ **All models**: [huggingface.co/datalyes](@datalyes )
## ๐ Resources
- **Paper**: [arXiv:2510.22264](https://arxiv.org/abs/2510.22264)
- **Datasets**: [huggingface.co/datalyes](@datalyes ) (15 tasks)
- **Models**: [huggingface.co/datalyes](@datalyes )
- **GitHub**: [github.com/iliass-y/patenteb](https://github.com/iliass-y/patenteb)
- **License**: CC BY-NC-SA 4.0 (non-commercial research use)
## ๐ Acknowledgments
Big thanks to :
- **Lens.org** for providing access to raw patent data at a reasonable cost for us little labs
- **MTEB community** for the excellent benchmark framework and the inspiration
- **Sentence Transformers** team for the powerful embedding library
#patent #nlp #embeddings #benchmark #retrieval #classification #mteb #sentence-transformers
Very excited to finally be able to announce the (partial) release of **PatenTEB**, the first comprehensive benchmark specifically designed for evaluating text embedding models on patent-specific tasks!
## ๐ What's Released
### ๐ฆ 15 Benchmark Datasets (NEW to MTEB)
All tasks are **completely new** and not previously available in MTEB or other benchmarks:
- **3 Classification tasks**: Patent citation timing, NLI directionality, IPC3 technology classification
- **2 Clustering tasks**: IPC-based and inventor-based patent grouping
- **8 Retrieval tasks**: 3 symmetric (IN/MIXED/OUT domain) + 5 asymmetric (fragment-to-document matching)
- **2 Paraphrase tasks**: Problem and solution semantic similarity detection
๐ **All datasets**: [huggingface.co/datalyes](@datalyes )
### ๐ค 12 Trained Models
The **patembed model family** (67M-344M parameters):
- 6 core models (large, base, base_small, small, mini, nano)
- 3 long-context variants (1024, 2048, 4096 tokens)
- 3 ablation models (no prompts, retrieval-only, no classification)
๐ **All models**: [huggingface.co/datalyes](@datalyes )
## ๐ Resources
- **Paper**: [arXiv:2510.22264](https://arxiv.org/abs/2510.22264)
- **Datasets**: [huggingface.co/datalyes](@datalyes ) (15 tasks)
- **Models**: [huggingface.co/datalyes](@datalyes )
- **GitHub**: [github.com/iliass-y/patenteb](https://github.com/iliass-y/patenteb)
- **License**: CC BY-NC-SA 4.0 (non-commercial research use)
## ๐ Acknowledgments
Big thanks to :
- **Lens.org** for providing access to raw patent data at a reasonable cost for us little labs
- **MTEB community** for the excellent benchmark framework and the inspiration
- **Sentence Transformers** team for the powerful embedding library
#patent #nlp #embeddings #benchmark #retrieval #classification #mteb #sentence-transformers