Qwen3.5-27B-Text2SQL

Fine-tuned Qwen/Qwen3.5-27B for Text-to-SQL generation. Given a database schema and a natural language question, the model outputs a clean SQL query.

Key Results

Metric Base Model This Model Improvement
Gretel Execution Accuracy (3,492 samples) 26.5% 66.7% +40.2%
Gretel Valid SQL 37.8% 89.4% +51.6%
Spider Execution Accuracy (1,032, gold std) 46.6% 55.1% +8.5%
Spider Exact Match 0.0% 22.2% +22.2%
Spider Keyword Score 45.5% 85.4% +39.9%

Regression (standard benchmarks via lm-eval-harness)

Benchmark Base This Model Delta
MMLU (humanities) 81.9% 83.5% +1.6% (no regression)
MMLU (STEM) 87.2% 86.7% -0.5% (no regression)
MMLU (social sciences) 92.0% 92.0% 0% (no regression)
MMLU (other) 87.5% 87.5% 0% (no regression)
GSM8K (math, strict) 60.4% 35.4% -25.0% (regression)
HellaSwag (common sense) 79.6% 84.1% +4.5% (improved)
ARC-Challenge (reasoning) 69.3% 71.3% +2.0% (improved)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mahernaija/Qwen3.5-27B-Text2SQL",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "mahernaija/Qwen3.5-27B-Text2SQL",
    trust_remote_code=True,
)

schema = "CREATE TABLE employees (id INTEGER, name TEXT, department TEXT, salary REAL);"
question = "Find all employees in Engineering with salary above 90000."

prompt = f"<|im_start|>system\nYou are a SQL expert. Given a database schema and a natural language question, write the correct SQL query.<|im_end|>\n<|im_start|>user\nSchema: {schema}\nQuestion: {question}<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
# SELECT * FROM employees WHERE department = 'Engineering' AND salary > 90000

With vLLM

vllm serve mahernaija/Qwen3.5-27B-Text2SQL --tensor-parallel-size 2
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
    model="mahernaija/Qwen3.5-27B-Text2SQL",
    messages=[
        {"role": "system", "content": "You are a SQL expert. Given a database schema and a natural language question, write the correct SQL query."},
        {"role": "user", "content": "Schema: CREATE TABLE products (id INT, name TEXT, price REAL);\nQuestion: What are the 3 most expensive products?"},
    ],
    max_tokens=256,
    temperature=0,
)
print(response.choices[0].message.content)
# SELECT name FROM products ORDER BY price DESC LIMIT 3

Training Details

Parameter Value
Base model Qwen/Qwen3.5-27B (26.9B params)
Method Full fine-tuning with FSDP
Hardware 16× NVIDIA H200 (2 nodes, Nebius AI Cloud)
Training time 3 hours 6 minutes
Dataset sql-create-context (78K) + Gretel synthetic (100K) = 178K samples
Epochs 1
Batch size 2 per GPU × 8 grad accum × 16 GPUs = 256 global
Learning rate 2e-5 (cosine decay, 5% warmup)
Sequence length 512 (SQL samples P99=373 tokens)
Precision BF16
GPU utilization 98-100%
Final train loss 0.144
Final eval loss 0.176

Data Preprocessing

  • Schemas stripped of INSERT data (model learns SQL from schema structure, not memorized answers)
  • Manual chat format (bypasses Qwen3.5 <think> tag injection)
  • Label masking: loss only on SQL output, prompt tokens masked with -100
  • Deduplication + contamination check between train and eval splits

Evaluation

Gretel Execution Accuracy (gold standard — runs SQL in SQLite, compares results):

Complexity Base This Model
Basic SQL 23.8% 71.4%
Aggregation 18.2% 54.5%
Single JOIN 25.0% 75.0%
Window functions 0.0% 33.3%

Spider Benchmark (1,034 dev questions, public standard):

  • Exact match: 22.2%
  • Keyword score: 85.4%

Known Limitations

Catastrophic forgetting: Full fine-tuning on SQL-only data caused regression in general capabilities. The model tries to answer non-SQL questions with SQL (56% SQL contamination on general prompts). For production use with mixed tasks, consider:

  • LoRA fine-tuning instead of full FT
  • Mixed training data (SQL + general chat)
  • Using this model only for SQL-specific pipelines

Regression Test

Category Base This Model SQL Contamination
General Knowledge 84% 44% 2/5
Math 100% 40% 3/5
Code 48% 47% 5/5
Language 90% 78% 4/5
Common Sense 82% 93% 0/5

Architecture

  • Architecture: Qwen3_5ForConditionalGeneration (VLM with text + vision)
  • Text backbone: 64 layers, hidden_size=5120, 24 attention heads, 4 KV heads
  • Attention: Hybrid GDN (linear_attention + full_attention)
  • Context: 262K tokens
  • Vision: Built-in 27-layer ViT (weights from base model, not finetuned)

Files

File Description
model-00001-of-00003.safetensors Text backbone weights (shard 1/3)
model-00002-of-00003.safetensors Text backbone weights (shard 2/3)
model-00003-of-00003.safetensors Text backbone weights (shard 3/3)
model-visual.safetensors Vision encoder weights (from base model)
config.json Full VLM config (required by vLLM)
tokenizer.json Tokenizer

Citation

@misc{naija2026qwen35text2sql,
  title={Qwen3.5-27B-Text2SQL: Fine-tuned Qwen 3.5 for Text-to-SQL},
  author={Maher Naija},
  year={2026},
  url={https://huggingface.co/mahernaija/Qwen3.5-27B-Text2SQL}
}

License

Apache 2.0 (same as base model Qwen/Qwen3.5-27B)

Acknowledgments

Downloads last month
406
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mahernaija/Qwen3.5-27B-Text2SQL

Base model

Qwen/Qwen3.5-27B
Finetuned
(168)
this model

Datasets used to train mahernaija/Qwen3.5-27B-Text2SQL