Gaia: Soil Microbiome Transformer

Gaia is a transformer language model for soil microbiome abundance profiles. It treats microbial genera as tokens and learns representations that transfer to downstream tasks such as soil pH, soil organic carbon, and crop yield prediction.

Model family

This repository contains multiple checkpoint lines from the Gaia project:

Checkpoint	Description
`gaia_v4/best/`	Final recommended checkpoint
`gaia_v3/best/`	v3 release
`gaia_v2/best/`	v2 release
`gaia_expanded/best/`	Expanded vocabulary variant
`mgm_soil/`, `mgm_soil_3k/`	Initial MGM-based soil pre-training runs

Intermediate checkpoint-* folders are also included for full reproducibility.

Input / Output

Input: Genus-level relative abundance of soil microbial community (~1000 genera), tokenized into a sequence of length 512 ordered by decreasing abundance.
Output: Embeddings suitable for linear probing / fine-tuning on soil property prediction (pH, carbon, yield, cross-site OOD, etc.).

Training data

Trained on the companion dataset Kimchikilla/gaia-corpus, which aggregates publicly available soil metagenomic studies (MGnify, EMP, Naylor, Bernburg long-term, USDA tillage, NEON, and more).

Code

Full training and evaluation pipeline, including tokenizer, preprocessing, and benchmark scripts, is available at: https://github.com/Kimchikilla/gaia

Citation

Paper draft available in the repository (docs/paper/). Please cite accordingly when using these weights.

Downloads last month: -; Downloads are not tracked for this model. How to track