Gaia: Soil Microbiome Transformer

Gaia is a transformer language model for soil microbiome abundance profiles. It treats microbial genera as tokens and learns representations that transfer to downstream tasks such as soil pH, soil organic carbon, and crop yield prediction.

Model family

This repository contains multiple checkpoint lines from the Gaia project:

Checkpoint Description
gaia_v4/best/ Final recommended checkpoint
gaia_v3/best/ v3 release
gaia_v2/best/ v2 release
gaia_expanded/best/ Expanded vocabulary variant
mgm_soil/, mgm_soil_3k/ Initial MGM-based soil pre-training runs

Intermediate checkpoint-* folders are also included for full reproducibility.

Input / Output

  • Input: Genus-level relative abundance of soil microbial community (~1000 genera), tokenized into a sequence of length 512 ordered by decreasing abundance.
  • Output: Embeddings suitable for linear probing / fine-tuning on soil property prediction (pH, carbon, yield, cross-site OOD, etc.).

Training data

Trained on the companion dataset Kimchikilla/gaia-corpus, which aggregates publicly available soil metagenomic studies (MGnify, EMP, Naylor, Bernburg long-term, USDA tillage, NEON, and more).

Code

Full training and evaluation pipeline, including tokenizer, preprocessing, and benchmark scripts, is available at: https://github.com/Kimchikilla/gaia

Citation

Paper draft available in the repository (docs/paper/). Please cite accordingly when using these weights.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support