4 1

iliass ayaou

datalyes

iliass-ayaou

AI & ML interests

information retrieval, patent retrieval, knowledge management, data engineering and architecture, NLP

Recent Activity

repliedto their post about 7 hours ago

Hello there ! quick update : After multiple requests ( very grateful for the interest ) ALL 15 PatenTeb tasks are accessible now ( automatic request approval ).

new activity about 7 hours ago

datalyes/DAPFAM_patent:How are exactly calculated the IN and OUT values?

posted an update 19 days ago

Hello there ! quick update : After multiple requests ( very grateful for the interest ) ALL 15 PatenTeb tasks are accessible now ( automatic request approval ).

View all activity

Organizations

repliedto their post about 7 hours ago

thank you !
Ideally yes but unsure about when i ll be able to get to it.

posted an update 19 days ago

Post

150

Hello there ! quick update :

After multiple requests ( very grateful for the interest ) ALL 15 PatenTeb tasks are accessible now ( automatic request approval ).

2 replies

posted an update 2 months ago

Post

265

I am happy to share that three models from PATENTEB are now publicly available: patembed-large, patembed-base, and patembed-base_long_4096.

The models can be found here:
https://huggingface.co/collections/datalyes/patembed-models-collection

Feedback, issues, and use cases are very welcome.

reactedto nouamanetazi's post with 🚀 5 months ago

Post

4683

After training 𝐒𝐦𝐨𝐥𝐋𝐌𝟑 on 𝟑𝟖𝟒 𝐇𝟏𝟎𝟎𝐬 for nearly a month, I've come to realize something most people overlook: 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐚𝐤𝐞-𝐨𝐫-𝐛𝐫𝐞𝐚𝐤 𝐟𝐚𝐜𝐭𝐨𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠. 🔥

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious 𝐍𝐂𝐂𝐋 𝐞𝐫𝐫𝐨𝐫𝐬, or when your expensive GPU cluster is running at 𝟔𝟎% 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲, the problem isn't your model. It's most probably a 𝐦𝐢𝐬𝐮𝐬𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐡𝐚𝐫𝐝𝐰𝐚𝐫𝐞. 🛠️

Questions that seemed simple but had no clear answers: Why is 𝐌𝐨𝐄 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐥𝐨𝐰𝐞𝐫 𝐭𝐡𝐚𝐧 𝐝𝐞𝐧𝐬𝐞 𝐦𝐨𝐝𝐞𝐥𝐬? Which 𝐍𝐂𝐂𝐋 𝐟𝐥𝐚𝐠𝐬 should we actually set? How often should we checkpoint without killing throughput?

That's why we built 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤 📖: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐥𝐚𝐲𝐞𝐫 that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: 𝐇𝐁𝐌𝟑 𝐡𝐢𝐭𝐭𝐢𝐧𝐠 𝟑 𝐓𝐁/𝐬, 𝐍𝐕𝐋𝐢𝐧𝐤 𝟒.𝟎 𝐫𝐞𝐚𝐜𝐡𝐢𝐧𝐠 𝟕𝟖𝟔 𝐆𝐁/𝐬, 𝐏𝐂𝐈𝐞 𝐆𝐞𝐧𝟒 𝐚𝐭 𝟏𝟒.𝟐 𝐆𝐁/𝐬. Then we ran collective operations across 𝟏𝟐𝟖 𝐆𝐏𝐔𝐬 (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from 𝟒𝟖𝟎 𝐆𝐁/𝐬 on a single node to 𝟑𝟐𝟎-𝟑𝟓𝟎 𝐆𝐁/𝐬 across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤: https://lnkd.in/e5MKXUHS

Shared with ❤️ by the HuggingFace team

reactedto piercus's post with 👀 5 months ago

Post

4000

Starts erasing! 🎉 🎉 🎉
This is made with a one-step SD1.5 LBM [1] eraser !

Data is open. Data pipeline is open. Training code is open.
On our LBM fork : https://github.com/finegrain-ai/LBM

[1] LBM: Latent Bridge Matching for Fast Image-to-Image Translation (2503.07535)

1 reply

reactedto sourceoftruthdata's post with ❤️ 5 months ago

Post

3457

What a fantastic community!

1 reply

posted an update 5 months ago

Post

381

# PatenTEB: A Comprehensive Benchmark for Patent Text Embeddings 🎯

Very excited to finally be able to announce the (partial) release of **PatenTEB**, the first comprehensive benchmark specifically designed for evaluating text embedding models on patent-specific tasks!

## 🚀 What's Released

### 📦 15 Benchmark Datasets (NEW to MTEB)
All tasks are **completely new** and not previously available in MTEB or other benchmarks:

- **3 Classification tasks**: Patent citation timing, NLI directionality, IPC3 technology classification
- **2 Clustering tasks**: IPC-based and inventor-based patent grouping
- **8 Retrieval tasks**: 3 symmetric (IN/MIXED/OUT domain) + 5 asymmetric (fragment-to-document matching)
- **2 Paraphrase tasks**: Problem and solution semantic similarity detection

🔗 **All datasets**: [huggingface.co/datalyes](@datalyes )

### 🤖 12 Trained Models
The **patembed model family** (67M-344M parameters):
- 6 core models (large, base, base_small, small, mini, nano)
- 3 long-context variants (1024, 2048, 4096 tokens)
- 3 ablation models (no prompts, retrieval-only, no classification)

🔗 **All models**: [huggingface.co/datalyes](@datalyes )

## 📖 Resources

- **Paper**: [arXiv:2510.22264](https://arxiv.org/abs/2510.22264)
- **Datasets**: [huggingface.co/datalyes](@datalyes ) (15 tasks)
- **Models**: [huggingface.co/datalyes](@datalyes )
- **GitHub**: [github.com/iliass-y/patenteb](https://github.com/iliass-y/patenteb)
- **License**: CC BY-NC-SA 4.0 (non-commercial research use)

## 🙏 Acknowledgments

Big thanks to :
- **Lens.org** for providing access to raw patent data at a reasonable cost for us little labs
- **MTEB community** for the excellent benchmark framework and the inspiration
- **Sentence Transformers** team for the powerful embedding library

#patent #nlp #embeddings #benchmark #retrieval #classification #mteb #sentence-transformers

iliass ayaou

AI & ML interests

Recent Activity

Organizations

datalyes's activity