Gaia: Soil Microbiome Transformer
Gaia is a transformer language model for soil microbiome abundance profiles. It treats microbial genera as tokens and learns representations that transfer to downstream tasks such as soil pH, soil organic carbon, and crop yield prediction.
Model family
This repository contains multiple checkpoint lines from the Gaia project:
| Checkpoint | Description |
|---|---|
gaia_v4/best/ |
Final recommended checkpoint |
gaia_v3/best/ |
v3 release |
gaia_v2/best/ |
v2 release |
gaia_expanded/best/ |
Expanded vocabulary variant |
mgm_soil/, mgm_soil_3k/ |
Initial MGM-based soil pre-training runs |
Intermediate checkpoint-* folders are also included for full reproducibility.
Input / Output
- Input: Genus-level relative abundance of soil microbial community (~1000 genera), tokenized into a sequence of length 512 ordered by decreasing abundance.
- Output: Embeddings suitable for linear probing / fine-tuning on soil property prediction (pH, carbon, yield, cross-site OOD, etc.).
Training data
Trained on the companion dataset Kimchikilla/gaia-corpus,
which aggregates publicly available soil metagenomic studies (MGnify, EMP,
Naylor, Bernburg long-term, USDA tillage, NEON, and more).
Code
Full training and evaluation pipeline, including tokenizer, preprocessing, and benchmark scripts, is available at: https://github.com/Kimchikilla/gaia
Citation
Paper draft available in the repository (docs/paper/). Please cite
accordingly when using these weights.